11. VAE

Posted Nov 15, 2025

By Mingyu An

3 min read

1. Introduction

Encoder & Decoder
- Efficient representation of input.
- Encoder: Input $\rightarrow$ Summarizer ($W_{enc}$) $\rightarrow$ $h$.
- Decoder: $h$ $\rightarrow$ Generator ($W_{dec}$) $\rightarrow$ Output $y$.
Autoencoder (AE)
- $\hookrightarrow$ Reconstruct input $x$ as output $\tilde{x}$ (learns code).
- Enc: $f_{\phi}(x) = h$, $q_{enc}(h x)$
- Dec: $H_{\theta}(h) = \tilde{x}$, $p_{dec}(\tilde{x} h)$
- $\rightarrow$ But we should regularize AE (if enc/dec with too much capacity $\rightarrow$ copy).
- (1) Robustness to noise.
- (2) Sparse representation.
Denoising Autoencoder
- $x \xrightarrow{noise} \tilde{x} \rightarrow \dots \rightarrow \hat{x}$.
- Variational Inference:
  - Variational?
  - (1) Similar input $\rightarrow$ Similar representation.
  - Gaussian noise, Dropout.
- $\hookrightarrow$ Optimization problem으로 표현하기 (Express as optimization problem).

Variational Inference
- Restrict $Q$.
- $\phi^* = \arg\min_{\phi} D_{KL}[q(z) p(z x)]$
- $\hookrightarrow$ 얘를 잘 정해야 함 (Need to define this well).
- $\hookrightarrow$ Tractable distribution (e.g., Gaussian).
- Allow sufficient $Q$.
- Good approx. to true posterior.
Common Restrictions on $Q$
- (1) Factorization:
  - Assume $Q$ factorises.
  - $Q(Z) = \prod_{j=1}^{m} Q_j(Z_j)$
  - $Z_j$: latent variable.
- (2) Parameterization:
  - Sigmoid, Softmax.
  - $D_{KL}[q(z;\phi) p(z x)]$
  - $\rightarrow P(Z x)$ is target.
  - $q(z;\phi) \rightarrow \phi^*$.
  - Inference = Finding $\phi^*$.
Challenge
- (1) Not computable.
- $\phi^* = \arg\min D_{KL}[q(z) p(z x)]$
- $D_{KL}[q(z) p(z x)] = E_{z \sim q}[\log q(z)] - E_{z \sim q}[\log p(z x)]$
- $= E_z[\log q(z)] - (E_z[\log p(z,x)] - \log p(x))$
- $= E_z[\log q(z)] - E_z[\log p(z,x)] + \log p(x)$
- $\hookrightarrow$ Intractable (Cannot compute $\log p(x)$).

Derivation
- $\log p(x) \ge ELBO$ (Evidence Lower Bound).
- Note: $D_{KL}[q(z) p(z x)] = E_z[\log q(z)] - E_z[\log p(z,x)] + \log p(x)$
- $\log p(x) - D_{KL}[q(z) p(z x)] = E_z[\log p(z,x)] - E_z[\log q(z)]$
- Left side: Constant ($\log p(x)$) - KL (always $\ge 0$).
- Right side: ELBO.
- $\therefore$ $D_{KL}$을 줄인다 = $q(\phi)$에 대해 ELBO를 Maximize 한다. (Reducing KL = Maximizing ELBO wrt $q(\phi)$).
ELBO Formulation
- $\mathcal{L}(q) = E_z[\log p(z,x)] - E_z[\log q(z)]$
- Recall that $p(z,x) = p(x z)p(z)$.
- $\mathcal{L}(q) = E_z[\log p(x z) + \log p(z)] - E_z[\log q(z)]$
- $= E_z[\log p(x z)] - D_{KL}[q(z) p(z)]$
  - Term 1: Expected likelihood ($E_{recon}$).
    - 오류를 낮게 (Low error).
  - Term 2: KL between prior $p(z)$ and $q(z)$.
    - Regularization ($P$와 $Q$가 너무 멀지 않게 - Not too far).
    - $Density \leftrightarrow Prior$.
Why ELBO?
- For any $q$, $\log p(x) \ge \mathcal{L}(q)$.
- $Proof$: $\mathcal{L}(q) + D_{KL}[q(z) p(z x)] = \log p(x)$.
- Since $D_{KL} \ge 0$, $\log p(x) \ge \mathcal{L}(q)$.
- Alternative Proof (Jensen’s Inequality):
  - $\log E[f(x)] \ge E[\log f(x)]$.

Concept
- Basically likelihood autoencoders.
- $X \xrightarrow[enc]{q(z x)} Z \xrightarrow[dec]{p(x z)} X$
- $\log p(x) \ge E_z[\log p(x z)] - D_{KL}[q_\phi(z x) p(z)]$
Structure
- (1) **Encoder ($q_\phi(z x)$)**: Parameterized.
- (2) **Decoder ($p_\theta(x z)$)**: Data generator.
- (3) Prior ($p(z)$): $z \sim \mathcal{N}(0, I)$.
Key Idea
- Training: Enc $q_\phi(z x)$ $\rightarrow$ $\mu, \sigma$ $\rightarrow$ sample $z$.
- Dec $p_\theta(x z)$ $\rightarrow$ $\tilde{x}$.
- Difference: Output of encoder is not $z$, but parameters of distribution $q(z x)$ from which we sample.
Comparison
- AE: $x \rightarrow z \rightarrow \tilde{x}$. (Feature learner).
- VAE: $x \rightarrow q_\phi(z x) \rightarrow z \rightarrow p_\theta(x z) \rightarrow \tilde{x}$. (Data generation).
  - $p(x z)$: 주인공 (Main actor).
Formulation & Architecture
- (1) Autoregressive: $p(x) = \prod p(x_i x_{<i})$.
- (2) Latents: $p(x) = \int p(x,z) dz = \int p(x z)p(z) dz$.
  - Prior $p(z)$ is simple ($\mathcal{N}(0, I)$).
  - $p(x z)$ is modeled by Decoder.
  - Output of $p_\theta(x z)$ depends on distribution type (Bernoulli $\rightarrow$ Sigmoid, Categorical $\rightarrow$ Softmax).

Training
- Deterministic reconstruction: $\hat{x} = E_{p(x z)}[x]$.
- Inference: Stochastic reconstruction $\hat{x} \sim p(x z)$.
Loss Function
- (1) KL Divergence (Regularizer):
  - $D_{KL}(q_\phi(z x) p(z))$.
  - Two Gaussians KL.
  - $\mathcal{L}{KL} = \frac{1}{2} \sum{k=1}^{K} (\sigma_k^2(x) + \mu_k^2(x) - 1 - \ln \sigma_k^2(x))$.
- (2) Reconstruction Loss:
  - $\mathcal{L}{recon} = -E{z \sim q}[ \log p_\theta(x z) ]$.
Reparameterization Trick
- Problem: $z \sim q_\phi(z x)$. How to backprop through sampling?
- Solution: Move randomness to input layer.
- $z^{(l)} \sim \mathcal{N}(\mu(x), \Sigma(x))$.
- Trick: $z^{(l)} = \mu(x) + \sigma(x) \odot \epsilon^{(l)}$ where $\epsilon^{(l)} \sim \mathcal{N}(0, I)$.
- Now differentiable w.r.t. $\mu$ and $\sigma$.

This post is licensed under CC BY 4.0 by the author.