Post

11. VAE

1. Introduction

  • Encoder & Decoder

    • Efficient representation of input.

    • Encoder: Input $\rightarrow$ Summarizer ($W_{enc}$) $\rightarrow$ $h$.

    • Decoder: $h$ $\rightarrow$ Generator ($W_{dec}$) $\rightarrow$ Output $y$.

  • Autoencoder (AE)

    • $\hookrightarrow$ Reconstruct input $x$ as output $\tilde{x}$ (learns code).

    • Enc: $f_{\phi}(x) = h$, $q_{enc}(hx)$
    • Dec: $H_{\theta}(h) = \tilde{x}$, $p_{dec}(\tilde{x}h)$
    • $\rightarrow$ But we should regularize AE (if enc/dec with too much capacity $\rightarrow$ copy).

    • (1) Robustness to noise.

    • (2) Sparse representation.
  • Denoising Autoencoder

    • $x \xrightarrow{noise} \tilde{x} \rightarrow \dots \rightarrow \hat{x}$.

    • Variational Inference:

      • Variational?

      • (1) Similar input $\rightarrow$ Similar representation.

      • Gaussian noise, Dropout.

    • $\hookrightarrow$ Optimization problem으로 표현하기 (Express as optimization problem).


2. Variational Inference

  • Variational Inference

    • Restrict $Q$.

    • $\phi^* = \arg\min_{\phi} D_{KL}[q(z) p(zx)]$
    • $\hookrightarrow$ 얘를 잘 정해야 함 (Need to define this well).

    • $\hookrightarrow$ Tractable distribution (e.g., Gaussian).

    • Allow sufficient $Q$.

    • Good approx. to true posterior.
  • Common Restrictions on $Q$

    • (1) Factorization:

      • Assume $Q$ factorises.

      • $Q(Z) = \prod_{j=1}^{m} Q_j(Z_j)$

      • $Z_j$: latent variable.

    • (2) Parameterization:

      • Sigmoid, Softmax.

      • $D_{KL}[q(z;\phi) p(zx)]$
      • $\rightarrow P(Zx)$ is target.
      • $q(z;\phi) \rightarrow \phi^*$.

      • Inference = Finding $\phi^*$.
  • Challenge

    • (1) Not computable.

    • $\phi^* = \arg\min D_{KL}[q(z) p(zx)]$
    • $D_{KL}[q(z) p(zx)] = E_{z \sim q}[\log q(z)] - E_{z \sim q}[\log p(zx)]$
    • $= E_z[\log q(z)] - (E_z[\log p(z,x)] - \log p(x))$

    • $= E_z[\log q(z)] - E_z[\log p(z,x)] + \log p(x)$

    • $\hookrightarrow$ Intractable (Cannot compute $\log p(x)$).

3. Evidence Lower Bound (ELBO)

  • Derivation

    • $\log p(x) \ge ELBO$ (Evidence Lower Bound).

    • Note: $D_{KL}[q(z) p(zx)] = E_z[\log q(z)] - E_z[\log p(z,x)] + \log p(x)$
    • $\log p(x) - D_{KL}[q(z) p(zx)] = E_z[\log p(z,x)] - E_z[\log q(z)]$
    • Left side: Constant ($\log p(x)$) - KL (always $\ge 0$).

    • Right side: ELBO.

    • $\therefore$ $D_{KL}$을 줄인다 = $q(\phi)$에 대해 ELBO를 Maximize 한다. (Reducing KL = Maximizing ELBO wrt $q(\phi)$).
  • ELBO Formulation

    • $\mathcal{L}(q) = E_z[\log p(z,x)] - E_z[\log q(z)]$

    • Recall that $p(z,x) = p(xz)p(z)$.
    • $\mathcal{L}(q) = E_z[\log p(xz) + \log p(z)] - E_z[\log q(z)]$
    • $= E_z[\log p(xz)] - D_{KL}[q(z) p(z)]$
      • Term 1: Expected likelihood ($E_{recon}$).

        • 오류를 낮게 (Low error).
      • Term 2: KL between prior $p(z)$ and $q(z)$.

        • Regularization ($P$와 $Q$가 너무 멀지 않게 - Not too far).

        • $Density \leftrightarrow Prior$.

  • Why ELBO?

    • For any $q$, $\log p(x) \ge \mathcal{L}(q)$.

    • $Proof$: $\mathcal{L}(q) + D_{KL}[q(z) p(zx)] = \log p(x)$.
    • Since $D_{KL} \ge 0$, $\log p(x) \ge \mathcal{L}(q)$.

    • Alternative Proof (Jensen’s Inequality):

      • $\log E[f(x)] \ge E[\log f(x)]$.

4. Variational Autoencoder (VAE)

  • Concept

    • Basically likelihood autoencoders.

    • $X \xrightarrow[enc]{q(zx)} Z \xrightarrow[dec]{p(xz)} X$
    • $\log p(x) \ge E_z[\log p(xz)] - D_{KL}[q_\phi(zx) p(z)]$
  • Structure

    • (1) **Encoder ($q_\phi(zx)$)**: Parameterized.
    • (2) **Decoder ($p_\theta(xz)$)**: Data generator.
    • (3) Prior ($p(z)$): $z \sim \mathcal{N}(0, I)$.
  • Key Idea

    • Training: Enc $q_\phi(zx)$ $\rightarrow$ $\mu, \sigma$ $\rightarrow$ sample $z$.
    • Dec $p_\theta(xz)$ $\rightarrow$ $\tilde{x}$.
    • Difference: Output of encoder is not $z$, but parameters of distribution $q(zx)$ from which we sample.
  • Comparison

    • AE: $x \rightarrow z \rightarrow \tilde{x}$. (Feature learner).

    • VAE: $x \rightarrow q_\phi(zx) \rightarrow z \rightarrow p_\theta(xz) \rightarrow \tilde{x}$. (Data generation).
      • $p(xz)$: 주인공 (Main actor).
  • Formulation & Architecture

    • (1) Autoregressive: $p(x) = \prod p(x_ix_{<i})$.
    • (2) Latents: $p(x) = \int p(x,z) dz = \int p(xz)p(z) dz$.
      • Prior $p(z)$ is simple ($\mathcal{N}(0, I)$).

      • $p(xz)$ is modeled by Decoder.
      • Output of $p_\theta(xz)$ depends on distribution type (Bernoulli $\rightarrow$ Sigmoid, Categorical $\rightarrow$ Softmax).

5. Training VAE

  • Training

    • Deterministic reconstruction: $\hat{x} = E_{p(xz)}[x]$.
    • Inference: Stochastic reconstruction $\hat{x} \sim p(xz)$.
  • Loss Function

    • (1) KL Divergence (Regularizer):

      • $D_{KL}(q_\phi(zx) p(z))$.
      • Two Gaussians KL.

      • $\mathcal{L}{KL} = \frac{1}{2} \sum{k=1}^{K} (\sigma_k^2(x) + \mu_k^2(x) - 1 - \ln \sigma_k^2(x))$.
    • (2) Reconstruction Loss:

      • $\mathcal{L}{recon} = -E{z \sim q}[ \log p_\theta(xz) ]$.
  • Reparameterization Trick

    • Problem: $z \sim q_\phi(zx)$. How to backprop through sampling?
    • Solution: Move randomness to input layer.

    • $z^{(l)} \sim \mathcal{N}(\mu(x), \Sigma(x))$.

    • Trick: $z^{(l)} = \mu(x) + \sigma(x) \odot \epsilon^{(l)}$ where $\epsilon^{(l)} \sim \mathcal{N}(0, I)$.

    • Now differentiable w.r.t. $\mu$ and $\sigma$.
This post is licensed under CC BY 4.0 by the author.