PCA

The idea of PCA is to find new basis in the same space such that the variance of the data is maximized along the new basis. The new basis are called principal components. The first principal component is the direction along which the variance of the data is maximized. The second principal component is the direction orthogonal to the first principal component along which the variance of the data is maximized. The third principal component is the direction orthogonal to the first two principal components along which the variance of the data is maximized. And so on.

Once we have all the basis vectors, we can project the data onto the first few principal components to reduce the dimensionality of the data. The variance captured by the principal components is proportional to the eigenvalues of the corresponding eigenvectors.

We want to find the direction along which the variance of the data is maximized. The variance of the data along a direction is given by the following equation:

\[\underset{u^{T}.u=1}{max} \; Var(u_{1}^T X)\]

Given that \(X\) is mean centered with respect to each feature, we can expand the variance as follows:

\[Var(u_{1}^T X) = (u_{1}^T X)(u_{1}^T X)^{T}\] \[Var(u_{1}^T X) = u_{1}^{T}Su_{1}\]

Here $S$ is the covariance matrix of $X$. The point to note is that in the following case the matrix $X$ is of size $d \times n$ where $d$ is the number of features and $n$ is the number of data points. Hence each element of matrix $S$ The covariance matrix $S$ is of size $d \times d$.

Now we want to find a \(u_{1}\) such that the variance is maximized. We can use the Lagrangian formulation to solve this problem. The Lagrangian formulation is as follows:

\[L(u_{1},\lambda) = u_{1}Su_{1} - \lambda(u_{1}^T u_{1} - 1)\]

We take the partial derivative of the Lagrangian formulation with respect to \(u_{1}\) and equate it to 0. We get the following equation:

\[Su_{1} = \lambda u_{1}\]

This is the equation for eigenvalue and eigenvector.

How does variance relate to eigenvalues? We can expand the variance as follows:

\[Var(u_{1}^T X) = u_{1}^{T}Su_{1}\] \[Var(u_{1}^T X) = u_{1}^{T}(\lambda_{1} u_{1})\] \[Var(u_{1}^T X) = \lambda_{1} u_{1}^{T}u_{1}\] \[Var(u_{1}^T X) = \lambda_{1}\]

Important Points

  1. Because we do eigenvalue decomposition on the covariance matrix, and by defination covariance matrix we have to do mean centering of the data.

  2. When we have a large dataset we can use the following trick to do PCA. We can use the following equation to calculate the covariance matrix:\(\frac{1}{n}X^TX\) as the resulting covariance matrix will be a square matrix of size \(dxd\) where \(d\) is the number of features.

  3. Once we have the eigenvectors all the data points can be projected onto the new basis and only the cofficients of the new basis will be stored. This is how we reduce the dimensionality of the data.

  4. PCA can be used to remove noise as the components with low variance can be removed.

  5. Most variance might not mean the most meaningful for instance lets say we are classifying dog breads then if we remove features with low variance we might remove the features that are important for classification.

  6. PCA is a linear transformation and orthogonal hence it can only capture linear relationships.

results matching ""

    No results matching ""