12. GAN
1. Introduction
Deep generative models
$\hookrightarrow$ Likelihood-based
Autoregressive models: Tractable data. $p(x) = \prod_{i=1}^{n} p(x_i x_1, \dots, x_{i-1})$. VAE: Intractable density, latent space $Z$. $p(x) = \int p(z)p(x z) dz$. Optimize ELBO. - Flow-based.
$\hookrightarrow$ Likelihood-free
GAN.
Diffusion (score-based).
Problem & Solution
$\Rightarrow$ $p(x)$를 추정하지 말고 Sample만 가능하게? (Don’t estimate p(x), just enable sampling?)
$\hookrightarrow$ Want to sample!
(1) Sample from a simple dist. (e.g., Gaussian).
(2) Learn complex transformation. (Simple $\rightarrow$ Complex).
2. GAN (Generative Adversarial Networks)
Concept
Game-theoretic approach.
Discriminator ($D$): Distinguish Real vs. Fake.
Real $\rightarrow$ 1.
Fake ($G(z)$) $\rightarrow$ 0.
Generator ($G$): Fool the discriminator.
$z \sim \mathcal{N}$ (Noise).
Generate fake data $G(z)$.
Wants $D(G(z)) \rightarrow 1$.
Diagram Flow
$z \text{ (Noise)} \xrightarrow{G} G(z) \text{ (Fake)} \xrightarrow{D} [0, 1]$
Real Data $x \xrightarrow{D} [0, 1]$
Objectives
Discriminator:
$D(x) \rightarrow 1$.
$D(G(z)) \rightarrow 0$.
Generator:
- $D(G(z)) \rightarrow 1$.
Comparison: GAN vs Diffusion
GAN: High-quality samples, fast sampling.
- Cons: Training instability, Mode Collapse.
Diffusion: Diverse samples, stable training.
- Cons: Slow inference (Long sampling time).
3. Formulation of Training Objectives
Minimax Game
- $\min_{G} \max_{D} V(D, G) = E_{x \sim p_{data}}[\log D(x)] + E_{z \sim p_z}[\log(1 - D(G(z)))]$
Optimal Discriminator
For fixed $G$, the optimal discriminator $D^*$ is:
$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$
Nash Equilibrium:
Occurs when $p_g(x) = p_{data}(x)$.
$D^*(x) = \frac{1}{2}$.
Value of game becomes $2 \log \frac{1}{2} = -\log 4$.
Related to minimizing Jensen-Shannon Divergence (JSD).
Gradient Issues (Vanishing Gradient)
Update Rule:
$\theta_g \leftarrow \theta_g - \eta \frac{\partial J}{\partial \theta_g}$.
$\frac{\partial J}{\partial \theta_g} = \frac{\partial J}{\partial D(G(z))} \cdot \frac{\partial D(G(z))}{\partial G(z)} \cdot \frac{\partial G(z)}{\partial \theta_g}$.
Problem:
If Discriminator is too good (Perfect $D$), then $D(G(z)) \approx 0$ (flat region of sigmoid).
$\frac{\partial D}{\partial G} \approx 0$.
G learns nothing (Gradient vanishes).
Solution (Heuristic / Non-saturating Loss):
Instead of minimizing $\log(1 - D(G(z)))$,
Maximize $\log D(G(z))$.
Provides stronger gradients early in training.
4. Advanced GANs
SAGAN (Self-Attention GAN)
Motivation:
Complex problem $\rightarrow$ Complex model.
Convolution only captures local dependencies.
Features:
Self-Attention: Captures global dependencies.
Spectral Normalization: Stabilizes discriminator training (Lipschitz constraint).
Conditional Generation: $G(z y)$, $D(x, y)$.
5. Performance Evaluation
Quality - Diversity Trade-off
(1) Quality
**Conditional distribution $p(y x)$**. If image $x$ is clear (High Quality) $\rightarrow$ Classifier predicts class $y$ confidently.
Low Entropy of $p(y x)$. (Sharp distribution). - (Entropy에 반비례 - Inversely proportional to entropy).
- Bad quality $x$ $\rightarrow$ High entropy.
(2) Diversity
**Marginal distribution $p(y) = \int p(y x=G(z)) dz$**. If $G$ generates diverse classes:
High Entropy of $p(y)$ (Uniform distribution over classes).
$x=G(z)$ 다양함 (Diverse) $\rightarrow$ High entropy.
Evaluation Metrics
(1) Inception Score (IS)
$IS(G) = \exp(\mathbb{E}{x \sim G} [ D{KL}( p(y x) p(y) ) ])$ Uses a pre-trained Inception Network.
Goal:
$p(y x)$ should be sharp (Low entropy) $\rightarrow$ High Quality. $p(y)$ should be flat (High entropy) $\rightarrow$ High Diversity.
- KL Divergence between them should be large.
Limitation:
만약 G가 1개의 Class 당 1개만 생성하면 (If G generates only 1 image per class):
Class 종류는 다양하고 (Classes are diverse $\rightarrow$ High $p(y)$ entropy).
하나의 그림에 대해 무조건 결과가 하나로 나오므로 (Conditional is sharp).
- Misrepresent: IS is high, but actual diversity (within class) is low. (Mode collapse not fully detected).
(2) Fréchet Inception Distance (FID)
$\hookrightarrow$ Measures distance between feature distributions of Real ($x_r$) and Generated ($x_g$) data.
$FID(x_r, x_g) = \mu_r - \mu_g ^2 + Tr(\Sigma_r + \Sigma_g - 2(\sqrt{\Sigma_r \Sigma_g}))$ Assumes features follow Gaussian distribution.
Lower is better (Distance 0 means identical distributions).
- More robust than IS.