In this post we introduce a common view on Expectation Maximization using Jensen's inequality.
The goal of EM is to maximize the likelihood of observed variable X explained by hidden variable Z under a model with parameters $\theta$. Noted that when Z is known (usually as labels), this problem becomes a supervised learning problem and it can be solved directly using maximum likelihood estimaation instead of EM. For a training dataset with size $n$, the expected log likelihood of a dataset with instances $x_1$, $x_2$, ... $x_n$ drawn from distribution $p_{true}(x)$ is given by:
\begin{align} \mathbb{E}_x (\log p_{\theta}(x)) \approx \frac{1}{n} \sum_{i=1}^{n} \log p_{\theta}(x_i), \quad x_i \sim p_{true}(x) \tag{0} \end{align}
Our objective is to maximize (0) by adjusting $\theta$
\begin{align} \operatorname*{argmax}_{\theta} \mathbb{E}_x (\log p_{\theta}(x_i)) &= \operatorname*{argmax}_{\theta} n \cdot \mathbb{E}_x (\log p_{\theta}(x_i)) \\ &= \operatorname*{argmax}_{\theta} \sum_{i=1}^{n} \log p_{\theta}(x_i) \\ &= \operatorname*{argmax}_{\theta} \sum_{i=1}^{n} \log \sum_{z_i} p_{\theta}(x_i , z_i) \\ &= \operatorname*{argmax}_{\theta} l(\theta) \tag{1} \end{align}
Starting from (1), derivation using Jensen's inequality adds a "helper" distribution $q$ such that
\begin{align}
l(\theta) &= \sum_{i=1}^{n} \log \sum_{z_i} p_{\theta}(x_i , z_i) \\
&= \sum_{i=1}^{n} \log \sum_{z_i} \frac{q(z_i) p_{\theta}(x_i , z_i)}{q(z_i)} \tag{2} \\
& \geq \sum_{i=1}^{n} \sum_{z_i} q(z_i) \log \frac{p_{\theta}(x_i , z_i)}{q(z_i)} \tag{3}
\end{align}
(2) is greater than or equal to (3) because Jensen's inequaitliy states that $f(E[Y]) \geq E[f(Y)]$ if $f$ is a concave function (for more details, look at Andrew's note or wikipedia). In our case, $f$ is $log$ which is a concave function and $Y$ is $\frac{p(x_i , z_i; \theta)}{q(z_i)}$. Using (3) as a lower bound of (2), we want to raise (3) as close as possible to (2) such that (3) $=$ (2) to maximize the likelihood. For this, Jensen's inequality also states that when $Y$ is constant, $f(E[Y]) = E[f(Y)]$. Therefore, we can construct $q$ such that Y is equal to some constant c as follow
$$ Y = \frac{p_{\theta}(x_i , z_i)}{q(z_i)} = c \tag{4} $$
If we assume each instance has the same corresponding constant c, then we have
$$ \frac{\sum_{z_i} p_{\theta}(x_i , z_i)}{\sum_{z_i} q(z_i)} = c \tag{5} $$
Given $q(z_i)$ is a probability distribution and $\sum_{z_i} q(z_i) = 1$, we have
$$ c = \sum_{z_i} p_{\theta}(x_i , z_i) = p_{\theta}(x_i) \tag{6} $$
Therefore, $q(z_i)$ can be written as follows
$$ q(z_i) = \frac{p_{\theta}(x_i , z_i)}{p_{\theta}(x_i)} = p_{\theta}(z_i | x_i) \tag{7} $$
(7) is our E step
$$ q(z_i) = p_{\theta}(z_i | x_i) \tag{E-step} $$
Using (7) to replace $q(z_i)$ in (3), we obtain our M step assuming $p_{\theta}(x_i \vert z_i)$ is constant
\begin{align} \theta_{new} &= \operatorname*{argmax}_{\theta} \sum_{i=1}^{n} \sum_{z_i} p_{\theta}(z_i | x_i) \log \frac{p_{\theta}(x_i , z_i)}{p_{\theta}(z_i | x_i)} \\ &= \operatorname*{argmax}_{\theta} \sum_{i=1}^{n} \sum_{z_i} p_{\theta}(z_i | x_i) \log p_{\theta}(x_i , z_i) - \sum_{z_i} p_{\theta}(z_i | x_i) \log p_{\theta}(z_i | x_i) \\ &= \operatorname*{argmax}_{\theta} \sum_{i=1}^{n} \sum_{z_i} p_{\theta}(z_i | x_i) \log p_{\theta}(x_i , z_i) - C \\ &= \operatorname*{argmax}_{\theta} \sum_{i=1}^{n} \sum_{z_i} p_{\theta}(z_i | x_i) \log p_{\theta}(x_i , z_i) \tag{M-step} \end{align}
For (5), it seems this assumption is necessary but ignored from Andrew's note and some other sources. How does such assumption affect the construction of $q(z_i)$?