Principal Component Analysis

Let us consider several important questions when we are subject to data with many variables which can easily arise in genomics or weather data in which the number of variables often exceed the number of observations. How can we visualise the data and what variables are the most relevant that explains all the variation that exists within the dataset? A naive approach to the first question is to consider a lower dimension, for instance two or three and select the variables from the dataset as it is and visualise using a graphical tool. Correlation matrices may also accompany this to gauge an understanding of which variables are highly correlated with each other and provide a sense of which variable is the one affecting the other variables the most.

The limitations of the above is that, in most real life situations, you would not expect a single variable to influence the others as much and its likely a mixture of the variables which influence them. Finding these may be hard, especially amongst large number of variables. Principal component analysis (PCA) essentially solves this problem by constructing new vectors of which each are a weighted sum of the variables from the initial dataset (the elements of the new vectors are the weights of the variables from the dataset). These new vectors are known as the principal components and are found by establishing two fundamental design properties:
  1. The reconstruction error is minimised, i.e. the sum of squares between the original datapoints and their respective projected points onto the new vector are minimised.
  2. The variance of the dataset projected onto the new vectors is maximised. This ensures that as much variance in the dataset can be explained by the principal components.
Surprisingly, the above two properties coincide in PCA and is closely tied to singular value decomposition (SVD) which we will first briefly cover below.

SVD

SVD is a process used on matrices of any size and it factors a matrix into a rotation, followed by rescaling and followed by another rotation matrix. If the matrix is unevenly sized, the first and last matrix will reduce/increase the dimensions of the matrices depending on whether the row or the column is larger in size. WLOG suppose that we have an \( n \times p \) matrix \( X \) with \( n > p \). Denote \( X^* \) to be the conjugate transpose of \( X \). Since \( X^*X \) is Hermitian, it is diagonalisable with orthonormal eigenvectors. Thus, we have \( \begin{aligned} A^*A = VDV^{-1} \end{aligned} \) where \( V \) is a unitary \( p \times p \) matrix and \( D \) the corresponding diagonal matrix of eigenvalues \( \lambda_i \) for \( i = 1,\dots, n. \) Now, these eigenvalues are non-negative and real. They are real because of the Hermitian property and they are non-negative because for any eigenvector \( v \) and corresponding eigenvalue \( \lambda \) of \( X^*X \), we have $$ \begin{aligned} 0 \leq \|Xv \|^2 &= \langle Xv, Xv \rangle \\ &= \langle v, X^* X v \rangle \\ &= \langle v, \lambda v \rangle \\ &= \lambda \langle v, v \rangle \\ &= \lambda \|v \|^2 \end{aligned} $$ and thus \( \lambda \geq 0 \). By setting \( u_i = \frac{1}{\sigma_i} Xv_i \) where \( \sigma_i = \sqrt{\lambda_i} \) (if any \( \lambda_i = 0 \), we have \( Xv_i = 0v_i = 0 = 0u_i; \) by performing Gram-Schmidt orthogonalisation, we can obtain the remaining \( u_i \) with corresponding \( \sigma_i = 0 \)) we note that for any \( 1 \leq i,j \leq r \) $$ \begin{aligned} \langle u_i , u_j \rangle &= \frac{1}{\sigma_i \sigma_j} \langle Xv_i, Xv_j \rangle \\ &= \frac{1}{\sigma_i \sigma_j} \langle v_i, X^* X v_j \rangle \\ &= \frac{1}{\sigma_i \sigma_j} \lambda_j \langle v_i, v_j \rangle \\ &= \delta_{ij} \end{aligned} $$ as \( v_i \)s are orthonormal and thus with \( u_i \) as the columns of a matrix \( U \), and \( S = \begin{bmatrix} \sigma_1 & & 0\\ & \ddots & & & 0 & \\ 0 & & \sigma_n \end{bmatrix} \) (an \( n \times p \) matrix) we can write $$ XV = US \implies X = USV^* $$ where both \( U, V \) are unitary and \( S \) almost a diagonal matrix (it is when \( n = p \))

PCA

As mentioned earlier, we aim to find a new vector \( \mathbf{w} \) which satisfies the properties (1) and (2). These two properties coincide and there is a really nice visualisation shown here. First, we will find \( \mathbf{w} \) such that (2) holds.

Consider the data matrix \( \mathbf{X} \) of size \( n \times p \) in which the rows represent an observation and the columns represent a variable. Further, assume each variable within the matrix is centred for the purposes of calculating variance and correlations (i.e. subtract \( \begin{bmatrix} \mu_1 & \dots & \mu_n \end{bmatrix} \) from each row where \( \mu_j = \frac{1}{n}\sum_{i} \mathbf{X}_{ij} \) ). Then consider \( \mathbf{Xw} \) where \( \mathbf{w} \) is to be determined. Note that \( \mathbf{Xw} \) here represents the scalar value of the projection of \( \mathbf{X} \) onto a single vector \( \mathbf{w} \). In particular, each observation is reduced to a single value consisting of a linear combination of the data variables with 'weights' specified by \( \mathbf{w} \). We need to now maximise the amount of variance given by the projected vector, so we maximise the sample variance (note it was mean-centred already) $$ \frac{1}{n-1} (\mathbf{Xw})^T \mathbf{Xw} = \mathbf{w}^T\mathbf{Cw} $$ where \( \mathbf{C} = \frac{1}{n-1}\mathbf{X}^T\mathbf{X} \). We simply desire a unit vector for its direction and thus we impose the restriction on the weights so that \( \|\mathbf{w}\| = 1 \). This is a typical Lagrange multiplier optimisation problem which we can solve by finding \( \lambda \) such that \( \nabla (\mathbf{w}^T\mathbf{Cw}) - \lambda(\nabla (\mathbf{w}^T\mathbf{w}-1)) = 0 \). Hence, by differentiating we have $$ \begin{aligned} 2\mathbf{Cw} - 2\lambda(\mathbf{w}) &= 0 \\ \implies \mathbf{Cw} &= \lambda \mathbf{w} \\ \implies \mathbf{w}^T\mathbf{Cw} &= \lambda \end{aligned} $$ and thus the variance of the projection of the dataset onto \( \mathbf{w} \) has a maximum given by the maximum eigenvalue of \( \mathbf{C} \) with the \( \mathbf{w} \) being the corresponding eigenvector of the eigenvalue. It should be emphasised here that \( \mathbf{w} \) represents the weighting of each variable in its contribution to demonstrating the variance \( \lambda \).

Now, observe that \( \mathbf{C} \) has a diagonlisation given by \( V\left (\frac{1}{n-1} D \right )V^* \) where the principal components are the eigenvectors in \( V \) which have variance given by \( \lambda_i = \frac{\alpha_i}{n-1}\) where \( \alpha_i \) are the eigenvalues of \( X^TX \)in \( D \). With the same \( V \), the SVD of \( \mathbf{X} \) is given by $$ \mathbf{X} = USV^* $$ where the \( S \) is constructed as in the SVD section using the \( \alpha_i \). Equivalently, we have $$ \mathbf{X}V = US. $$ From looking at this equation, we see that we do indeed have the principal component vectors given by the eigenvectors in \( V \) with the magnitude of the vector given by \( US.\) For example, the \( i \)th observation has a magnitude w.r.t to the \( j \)th principal component given by \( (US)_{ij}. \)

Normal conventions dictate that we order the eigenvalues such that it goes from the largest to the smallest when we perform the decomposition of \( \mathbf{C} \text{ and } \mathbf{X} \), but it changes nothing as long as we rearrange the corresponding eigenvectors and \( u_i \) appropriately. Now that we have these principal component vectors in which the first few should dictate the majority of the variance in the data (evidence across many fields and datasets that this generally is the case), it is plausible to visualise this by reducing the dimension of the principal components to at most three. For example, selecting the first 3 columns of \( U \) and upper \( 3 \times 3 \) matrix of \( S \) and multiplying these two together, we obtain the coordinates for the first three principal component vectors for all our observations.

For (1), we can check by minimising \( \|\mathbf{X} -\mathbf{Xww}^T \|^2 \) and seeing that we obtain the same set of equations as above. Note that we are working with the Frobenius norm here. By applying standard matrix calculus, using the cyclic property of the trace and the fact that \( \text{Tr}(ab) = \text{Tr}(ba) \) for any conformable matrices \( a,b \), we have $$ \begin{aligned} A(w) = \|\mathbf{X} -\mathbf{Xww}^T \|^2 &= \text{Tr}((\mathbf{X} -\mathbf{Xww}^T)^T(\mathbf{X} -\mathbf{Xww}^T)) \\ &= \text{Tr}(\mathbf{X}^T\mathbf{X}) - 2\text{Tr}(\mathbf{X}^T\mathbf{Xww}^T) + \text{Tr}(\mathbf{X}^T\mathbf{Xww}^T\mathbf{Xww}^T) \\ &= \text{Tr}(\mathbf{X}^T\mathbf{X}) - 2\text{Tr}(\mathbf{w}^T\mathbf{X}^T\mathbf{Xw}) + \text{Tr}(\mathbf{w}^T\mathbf{X}^T\mathbf{Xw}\mathbf{w}^T\mathbf{w}) \\ &= \text{Tr}(\mathbf{X}^T\mathbf{X}) - 2\text{Tr}{\mathbf{w}^TCw} + \text{Tr}(\mathbf{w}^TC\mathbf{w}\mathbf{w}^T\mathbf{w}) \\ &= \text{Tr}(\mathbf{X}^T\mathbf{X}) - \mathbf{w}^T\mathbf{X}^T\mathbf{Xw} \end{aligned} $$ as \( \mathbf{w}^T\mathbf{w} = 1 \). Now, by setting up the Lagrange multiplier optimisation problem as before, we have $$ \begin{aligned} \nabla A(\mathbf{w}) + \lambda (\nabla (\mathbf{w}^T\mathbf{w}-1)) &= 0 \\ -2\mathbf{X}^T\mathbf{Xw} + 2\lambda \mathbf{w} &= 0 \\ \implies \mathbf{X}^T\mathbf{Xw} &= \lambda \mathbf{w} \end{aligned} $$ and thus we obtain the same eigenvalue problem as before. Thus (1) and (2) coincide in PCA.

We talked about visualisation by reducing the dimensions, but you may ask that the principal components are still a vector of dimension \( p \). This is a simple question to solve since most of the variance are explained by the first few principal components (with at most 3 visualisable on a single graph), and we know these are orthogonal to each other. There is a natural vector space isomorphism mapping the principal components to the standard basis vectors on the lower dimensional spaces and we can visualise them there. Note that this isomorphism is an isometry.