In this post, we introduce the math foundation behind principal component analysis, a simple technique for dimensionality reduction.
yang's post (written in Chinese)
Diagonalizable matrix
The high-level idea of principal component analysis is to reduce the dimensionality of a given dataset by transforming it into a new dataset in which all variables (features) are independent of each other. We denote this transformation as follows:
$$ Y = PX \tag{0} $$
where $X \in R^{m \times n}$ is the original dataset with $m$ instances and $n$ variables, $P \in R^{n \times r}$ is the reduction matrix, and $Y \in R^{m \times r}$ is the transformed dataset with $r$ variables. To better understand how information was stored in our dataset matrix $X$, we can use the concept of variance and covariance of feature variables in dataset matrix $X$. Let's denote $v_a$ as a variable in the variable vector $v\in R^n$. The variance of $v_a$ is:
$$ \mathrm{Var}(v_a) = \frac{1}{m} \sum_{i=1}^{m} (v_a^i - E[v_a])^2 \tag{1} $$
The magnitude of the variance can be used to indicate the volume of information embedded in this variable. A variable contains no information is a constant with zero variance. Covariance, as another measure, indicate the level of dependency between two variables:
$$ \mathrm{Cov}(v_a, v_b) = \frac{1}{m} \sum_{i=1}^{m} (v_a^i - E[v_a])(v_b^i - E[v_b]) \tag{2} $$
To represent both variance and covariance in a matrix form, we can use the concept of covariance matrix. Let's denote the covariance matrix of the random variables vector $v$ as $C$:
\begin{align} C &= \begin{bmatrix} \mathrm{Var}(v_1) & \mathrm{Cov}(v_1, v_2) & \dots & \mathrm{Cov}(v_1, v_n) \\ \mathrm{Cov}(v_2, v_1) & \mathrm{Var}(v_2) & \dots & \mathrm{Cov}(v_2, v_n) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{Cov}(v_n, v_1) & \mathrm{Cov}(v_n, v_2) & \dots & \mathrm{Var}(v_n) \end{bmatrix} \tag{3} \\ \end{align}
Where the diagonal of $C$ are variance of each variable, and the non-diagonal elements are covariances of any two variables. Noted that if we normalized $X$ by making all $E[v_i] = 0$, our transformed $X$ still contains the same information; therefore, our covariance matrix can be simplified as follows:
\begin{align} C &= \begin{bmatrix} \frac{1}{m} \sum_{i=1}^{m} (v_1)^2 & \frac{1}{m} \sum_{i=1}^{m} v_1 v_2 & \dots & \frac{1}{m} \sum_{i=1}^{m} v_1 v_n \\ \frac{1}{m} \sum_{i=1}^{m} v_2 v_1 & \frac{1}{m} \sum_{i=1}^{m} (v_2)^2 & \dots & \frac{1}{m} \sum_{i=1}^{m} v_2 v_n \\ \vdots & \vdots & \ddots & \vdots \\ \frac{1}{m} \sum_{i=1}^{m} v_n v_1 & \frac{1}{m} \sum_{i=1}^{m} v_n v_2 & \dots & \frac{1}{m} \sum_{i=1}^{m} (v_n)^2 \end{bmatrix} \tag{3} \\ &= \frac{1}{m} X^T X \tag{4} \end{align}
From (3), we know that $C$ is the covariance matrix associated with the original dataset $X$. For the transformed dataset $Y = XP$, we can also obtain a covariance matrix $D \in R^{r \times r}$ using the same procedural as (4). Interestingly, $D$ has can be rewritten with only $P$ and $C$:
\begin{align} D &= \frac{1}{m} Y^T Y \\ &= \frac{1}{m} (XP)^T (XP) \\ &= \frac{1}{m} P^T X^T(XP) \\ &= P^T \frac{1}{m} X^T X P \\ &= P^T C P \tag{5} \end{align}
Since $D$ is a covariance matrix that contains variances and covariances of the new feature variables vector, our objective is to find a $P$ such that any two random variables of this new feature variables vector have zero covariance.In addition, because $D$ has a smaller dimension than $C$, we also want this new feature variables vector to retain as much information as possible from the original dataset. Given $r$ real values such that $\lambda_1 \geq \lambda_2 \dots \lambda_r$, we want D to be a diagonal covariance matrix:
\begin{align} D &= \begin{bmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_r \end{bmatrix} \tag{6} \\ \end{align}
This process of finding $P$ that turns D into a diagonal matrix is called diagonalization in linear algebra (see Diagonalizable matrix). It involves eigendecomposition of matrix $C$, which happens to be a real symmetric matrix in this case. So if we apply eigendecomposition on $C$, we get:
\begin{align} C &= P D P^T \tag{7} \end{align}
where $P$ is an orthogonal matrix.
It's also common to solve PCA problem with singular value decomposition. First, let's apply SVD on $X$:
\begin{align} X &= U \Sigma V^T \tag{8} \end{align}
For $X^T X$, we have:
\begin{align} X^T X &= (U \Sigma V^T)^T U \Sigma V^T \\ &= V \Sigma^T U^T U \Sigma V^T \\ &= V \Sigma^T \Sigma V^T \tag{$U^T U = I$} \\ &= V \Sigma \Sigma V^T \tag{$\Sigma^T = \Sigma$, because $\Sigma$ is a symmetric matrix} \\ &= V \Sigma^2 V^T \tag{9} \end{align}
Given (7) and (9), it's not difficult to see that $P = V$ and $m D = \Sigma^2$, where $m$ is a constant. This means we can solve PCA by only applying SVD on $X$, without calculating $X^T X$.