11. VAE
1. Introduction
Encoder & Decoder
Efficient representation of input.
Encoder: Input $\rightarrow$ Summarizer ($W_{enc}$) $\rightarrow$ $h$.
Decoder: $h$ $\rightarrow$ Generator ($W_{dec}$) $\rightarrow$ Output $y$.
Autoencoder (AE)
$\hookrightarrow$ Reconstruct input $x$ as output $\tilde{x}$ (learns code).
Enc: $f_{\phi}(x) = h$, $q_{enc}(h x)$ Dec: $H_{\theta}(h) = \tilde{x}$, $p_{dec}(\tilde{x} h)$ $\rightarrow$ But we should regularize AE (if enc/dec with too much capacity $\rightarrow$ copy).
(1) Robustness to noise.
- (2) Sparse representation.
Denoising Autoencoder
$x \xrightarrow{noise} \tilde{x} \rightarrow \dots \rightarrow \hat{x}$.
Variational Inference:
Variational?
(1) Similar input $\rightarrow$ Similar representation.
Gaussian noise, Dropout.
$\hookrightarrow$ Optimization problem으로 표현하기 (Express as optimization problem).
2. Variational Inference
Variational Inference
Restrict $Q$.
$\phi^* = \arg\min_{\phi} D_{KL}[q(z) p(z x)]$ $\hookrightarrow$ 얘를 잘 정해야 함 (Need to define this well).
$\hookrightarrow$ Tractable distribution (e.g., Gaussian).
Allow sufficient $Q$.
- Good approx. to true posterior.
Common Restrictions on $Q$
(1) Factorization:
Assume $Q$ factorises.
$Q(Z) = \prod_{j=1}^{m} Q_j(Z_j)$
$Z_j$: latent variable.
(2) Parameterization:
Sigmoid, Softmax.
$D_{KL}[q(z;\phi) p(z x)]$ $\rightarrow P(Z x)$ is target. $q(z;\phi) \rightarrow \phi^*$.
- Inference = Finding $\phi^*$.
Challenge
(1) Not computable.
$\phi^* = \arg\min D_{KL}[q(z) p(z x)]$ $D_{KL}[q(z) p(z x)] = E_{z \sim q}[\log q(z)] - E_{z \sim q}[\log p(z x)]$ $= E_z[\log q(z)] - (E_z[\log p(z,x)] - \log p(x))$
$= E_z[\log q(z)] - E_z[\log p(z,x)] + \log p(x)$
- $\hookrightarrow$ Intractable (Cannot compute $\log p(x)$).
3. Evidence Lower Bound (ELBO)
Derivation
$\log p(x) \ge ELBO$ (Evidence Lower Bound).
Note: $D_{KL}[q(z) p(z x)] = E_z[\log q(z)] - E_z[\log p(z,x)] + \log p(x)$ $\log p(x) - D_{KL}[q(z) p(z x)] = E_z[\log p(z,x)] - E_z[\log q(z)]$ Left side: Constant ($\log p(x)$) - KL (always $\ge 0$).
Right side: ELBO.
- $\therefore$ $D_{KL}$을 줄인다 = $q(\phi)$에 대해 ELBO를 Maximize 한다. (Reducing KL = Maximizing ELBO wrt $q(\phi)$).
ELBO Formulation
$\mathcal{L}(q) = E_z[\log p(z,x)] - E_z[\log q(z)]$
Recall that $p(z,x) = p(x z)p(z)$. $\mathcal{L}(q) = E_z[\log p(x z) + \log p(z)] - E_z[\log q(z)]$ $= E_z[\log p(x z)] - D_{KL}[q(z) p(z)]$ Term 1: Expected likelihood ($E_{recon}$).
- 오류를 낮게 (Low error).
Term 2: KL between prior $p(z)$ and $q(z)$.
Regularization ($P$와 $Q$가 너무 멀지 않게 - Not too far).
$Density \leftrightarrow Prior$.
Why ELBO?
For any $q$, $\log p(x) \ge \mathcal{L}(q)$.
$Proof$: $\mathcal{L}(q) + D_{KL}[q(z) p(z x)] = \log p(x)$. Since $D_{KL} \ge 0$, $\log p(x) \ge \mathcal{L}(q)$.
Alternative Proof (Jensen’s Inequality):
- $\log E[f(x)] \ge E[\log f(x)]$.
4. Variational Autoencoder (VAE)
Concept
Basically likelihood autoencoders.
$X \xrightarrow[enc]{q(z x)} Z \xrightarrow[dec]{p(x z)} X$ $\log p(x) \ge E_z[\log p(x z)] - D_{KL}[q_\phi(z x) p(z)]$
Structure
(1) **Encoder ($q_\phi(z x)$)**: Parameterized. (2) **Decoder ($p_\theta(x z)$)**: Data generator. - (3) Prior ($p(z)$): $z \sim \mathcal{N}(0, I)$.
Key Idea
Training: Enc $q_\phi(z x)$ $\rightarrow$ $\mu, \sigma$ $\rightarrow$ sample $z$. Dec $p_\theta(x z)$ $\rightarrow$ $\tilde{x}$. Difference: Output of encoder is not $z$, but parameters of distribution $q(z x)$ from which we sample.
Comparison
AE: $x \rightarrow z \rightarrow \tilde{x}$. (Feature learner).
VAE: $x \rightarrow q_\phi(z x) \rightarrow z \rightarrow p_\theta(x z) \rightarrow \tilde{x}$. (Data generation). $p(x z)$: 주인공 (Main actor).
Formulation & Architecture
(1) Autoregressive: $p(x) = \prod p(x_i x_{<i})$. (2) Latents: $p(x) = \int p(x,z) dz = \int p(x z)p(z) dz$. Prior $p(z)$ is simple ($\mathcal{N}(0, I)$).
$p(x z)$ is modeled by Decoder. Output of $p_\theta(x z)$ depends on distribution type (Bernoulli $\rightarrow$ Sigmoid, Categorical $\rightarrow$ Softmax).
5. Training VAE
Training
Deterministic reconstruction: $\hat{x} = E_{p(x z)}[x]$. Inference: Stochastic reconstruction $\hat{x} \sim p(x z)$.
Loss Function
(1) KL Divergence (Regularizer):
$D_{KL}(q_\phi(z x) p(z))$. Two Gaussians KL.
- $\mathcal{L}{KL} = \frac{1}{2} \sum{k=1}^{K} (\sigma_k^2(x) + \mu_k^2(x) - 1 - \ln \sigma_k^2(x))$.
(2) Reconstruction Loss:
$\mathcal{L}{recon} = -E{z \sim q}[ \log p_\theta(x z) ]$.
Reparameterization Trick
Problem: $z \sim q_\phi(z x)$. How to backprop through sampling? Solution: Move randomness to input layer.
$z^{(l)} \sim \mathcal{N}(\mu(x), \Sigma(x))$.
Trick: $z^{(l)} = \mu(x) + \sigma(x) \odot \epsilon^{(l)}$ where $\epsilon^{(l)} \sim \mathcal{N}(0, I)$.
- Now differentiable w.r.t. $\mu$ and $\sigma$.