12. GAN

Posted Nov 29, 2025

By Mingyu An

3 min read

12. GAN

1. Introduction

Deep generative models
- $\hookrightarrow$ Likelihood-based
  1. Autoregressive models: Tractable data. $p(x) = \prod_{i=1}^{n} p(x_i x_1, \dots, x_{i-1})$.
  2. VAE: Intractable density, latent space $Z$. $p(x) = \int p(z)p(x z) dz$. Optimize ELBO.
  3. Flow-based.
- $\hookrightarrow$ Likelihood-free
  1. GAN.
  2. Diffusion (score-based).
Problem & Solution
- $\Rightarrow$ $p(x)$를 추정하지 말고 Sample만 가능하게? (Don’t estimate p(x), just enable sampling?)
- $\hookrightarrow$ Want to sample!
- (1) Sample from a simple dist. (e.g., Gaussian).
- (2) Learn complex transformation. (Simple $\rightarrow$ Complex).

Concept
- Game-theoretic approach.
- Discriminator ($D$): Distinguish Real vs. Fake.
  - Real $\rightarrow$ 1.
  - Fake ($G(z)$) $\rightarrow$ 0.
- Generator ($G$): Fool the discriminator.
  - $z \sim \mathcal{N}$ (Noise).
  - Generate fake data $G(z)$.
  - Wants $D(G(z)) \rightarrow 1$.
Diagram Flow
- $z \text{ (Noise)} \xrightarrow{G} G(z) \text{ (Fake)} \xrightarrow{D} [0, 1]$
- Real Data $x \xrightarrow{D} [0, 1]$
Objectives
- Discriminator:
  - $D(x) \rightarrow 1$.
  - $D(G(z)) \rightarrow 0$.
- Generator:
  - $D(G(z)) \rightarrow 1$.
Comparison: GAN vs Diffusion
- GAN: High-quality samples, fast sampling.
  - Cons: Training instability, Mode Collapse.
- Diffusion: Diverse samples, stable training.
  - Cons: Slow inference (Long sampling time).

Minimax Game
- $\min_{G} \max_{D} V(D, G) = E_{x \sim p_{data}}[\log D(x)] + E_{z \sim p_z}[\log(1 - D(G(z)))]$
Optimal Discriminator
- For fixed $G$, the optimal discriminator $D^*$ is:
- $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$
- Nash Equilibrium:
  - Occurs when $p_g(x) = p_{data}(x)$.
  - $D^*(x) = \frac{1}{2}$.
  - Value of game becomes $2 \log \frac{1}{2} = -\log 4$.
  - Related to minimizing Jensen-Shannon Divergence (JSD).
Gradient Issues (Vanishing Gradient)
- Update Rule:
  - $\theta_g \leftarrow \theta_g - \eta \frac{\partial J}{\partial \theta_g}$.
  - $\frac{\partial J}{\partial \theta_g} = \frac{\partial J}{\partial D(G(z))} \cdot \frac{\partial D(G(z))}{\partial G(z)} \cdot \frac{\partial G(z)}{\partial \theta_g}$.
- Problem:
  - If Discriminator is too good (Perfect $D$), then $D(G(z)) \approx 0$ (flat region of sigmoid).
  - $\frac{\partial D}{\partial G} \approx 0$.
  - G learns nothing (Gradient vanishes).
- Solution (Heuristic / Non-saturating Loss):
  - Instead of minimizing $\log(1 - D(G(z)))$,
  - Maximize $\log D(G(z))$.
  - Provides stronger gradients early in training.

SAGAN (Self-Attention GAN)
- Motivation:
  - Complex problem $\rightarrow$ Complex model.
  - Convolution only captures local dependencies.
- Features:
  1. Self-Attention: Captures global dependencies.
  2. Spectral Normalization: Stabilizes discriminator training (Lipschitz constraint).
  3. Conditional Generation: $G(z y)$, $D(x, y)$.

(1) Quality

**Conditional distribution $p(y x)$**.
If image $x$ is clear (High Quality) $\rightarrow$ Classifier predicts class $y$ confidently.
Low Entropy of $p(y x)$. (Sharp distribution).
- (Entropy에 반비례 - Inversely proportional to entropy).
Bad quality $x$ $\rightarrow$ High entropy.

(2) Diversity

**Marginal distribution $p(y) = \int p(y x=G(z)) dz$**.
If $G$ generates diverse classes:
- High Entropy of $p(y)$ (Uniform distribution over classes).
- $x=G(z)$ 다양함 (Diverse) $\rightarrow$ High entropy.

(1) Inception Score (IS)

$IS(G) = \exp(\mathbb{E}{x \sim G} [ D{KL}( p(y x) p(y) ) ])$
Uses a pre-trained Inception Network.
Goal:
- $p(y x)$ should be sharp (Low entropy) $\rightarrow$ High Quality.
- $p(y)$ should be flat (High entropy) $\rightarrow$ High Diversity.
- KL Divergence between them should be large.
Limitation:
- 만약 G가 1개의 Class 당 1개만 생성하면 (If G generates only 1 image per class):
  1. Class 종류는 다양하고 (Classes are diverse $\rightarrow$ High $p(y)$ entropy).
  2. 하나의 그림에 대해 무조건 결과가 하나로 나오므로 (Conditional is sharp).
  - Misrepresent: IS is high, but actual diversity (within class) is low. (Mode collapse not fully detected).

(2) Fréchet Inception Distance (FID)

$\hookrightarrow$ Measures distance between feature distributions of Real ($x_r$) and Generated ($x_g$) data.
$FID(x_r, x_g) = \mu_r - \mu_g ^2 + Tr(\Sigma_r + \Sigma_g - 2(\sqrt{\Sigma_r \Sigma_g}))$
Assumes features follow Gaussian distribution.
Lower is better (Distance 0 means identical distributions).
More robust than IS.

This post is licensed under CC BY 4.0 by the author.