3. Optimization

Posted Oct 9, 2025

By Mingyu An

5 min read

3. Optimization

1. Gradient-based Optimization

(1) Slope 기반 (Slope-based)

Gradient 모으기 (Collect gradients) (vector) $\rightarrow$ Gradient Descent

(2) Curvature 기반 (Curvature-based)

Hessian (matrix) $\rightarrow$ Newton’s Method

Method of Gradient Descent

$\hookrightarrow$ First-order, iterative.
$f(\theta)$: Target to minimize.

Newton’s Method

$\hookrightarrow$ Zero-finding
- $\theta \leftarrow \theta - \epsilon \cdot \frac{\partial f}{\partial \theta}$ (Gradient Descent)
- $x_{t+1} = x_{t} - \frac{f(x_{t})}{f’(x_{t})}$ (Newton’s root finding)
$\hookrightarrow$ Minimization
- $x_{t+1} = x_{t} - \frac{f’(x_{t})}{f’‘(x_{t})}$
- Note: $\frac{1}{f’‘(x_t)}$ acts as learning rate.
Guarantees only local minima.
- (Initial location is critical)
Issues:
- $nC_2$ is unbearable.
- 직관 (Intuition): 멀리 (Far jump based on curvature).
- $f: \mathbb{R}^n \rightarrow \mathbb{R}$, $\theta \leftarrow \theta - H^{-1}\nabla f$
- $H_f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \ \vdots & \ddots & \vdots \end{bmatrix} \in \mathbb{R}^{n \times n}$
- O($n^2$) space!
- Covariance matrix are needed: $C_{ij} = Cov(X_i, X_j)$, $C \in \mathbb{R}^{n \times n}$.
- $H^{-1}$를 구하려면 O($n^3$) operation.

2. Gradient Noise

[Diagram: Trajectory of W]

Smooth curve vs Wiggly curve (SGD)
Stochastic: On a set of $m$-examples.
$\hat{g} = \frac{1}{m} \nabla_{\theta} \sum_{i=1}^{m} L(y^{(i)}, \hat{y}^{(i)})$

Optimize in DL

Training이 제일 힘듦 (Training is the hardest).
Stochastic Gradient Descent (SGD)

Gradient Descent Steps

$\hat{g} =$ an estimate of gradient (avg. gradient).
$\theta \leftarrow \theta - \epsilon \hat{g}$

But suffer from…

Local minima: Zero gradient.
Gradient noise: (SGD path implies noise).
Poor condition of Hessian: Half pipe shape.

Reducing Noise

(1) Sample-wise
- $m=1$ (SGD) $\rightarrow$ $1 < m < N$ (Mini-batch) $\rightarrow$ $m=N$ (Batch GD - expensive).
- Increasing batch size reduces noise ($1 \rightarrow 200$).
(2) Time-wise
- Moving average.
- EWMA (Exponentially Weighted Moving Average).

EWMA

$v_t = \begin{cases} g_1 & t=1 \ \alpha v_{t-1} + (1-\alpha)g_t & t > 1 \end{cases}$
$\rightarrow$ $v_t$: EWMA at time $t$.
$\alpha \in [0, 1]$: Constant smoothing factor.
- $\alpha = 0.9$: Smoother curve, adapts slowly.
- $\alpha = 0.5$: Adapts quickly.
Expansion:
- $v_t = (1-\alpha)[g_t + \alpha g_{t-1} + \alpha^2 g_{t-2} + \dots]$
- (Normalized sum: $\frac{g_t + \alpha g_{t-1} + \dots}{1 + \alpha + \alpha^2 + \dots}$)
Interpretation:
- $v_t \approx$ average over $\frac{1}{1-\alpha}$ last time points.
- Graph is shifted further.

Types of Moving Average

SMA (Simple Moving Average): $v_t = \frac{\hat{g}t + \hat{g}{t-1} + \hat{g}_{t-2}}{3}$
WMA (Weighted Moving Average): $v_t = \frac{w_1 \hat{g}t + \dots + w_3 \hat{g}{t-2}}{\sum w_i}$
- Convolution with learned weight.
- $v_t = \alpha \hat{g}t + (1-\alpha) v{t-1}$ (Recursive).

Gradient Descent with Momentum

(관성 느낌 - Like inertia)
$\hookrightarrow$ Compute EWMA of gradients and use it to update.
Result: Faster, less noise.

Forms

Let $g := \nabla_{\theta} J(\theta)$

(1) Standard Momentum

$v \leftarrow \alpha v - \epsilon g$
$\theta \leftarrow \theta + v \Rightarrow \theta \leftarrow \theta - \epsilon (g + \frac{\alpha}{-\epsilon}v)$ ? (Note: Notation varies)

(2) Alternative Form

$v \leftarrow \alpha v + (1-\alpha)g$
$\theta \leftarrow \theta - \epsilon v$

(3) Common Form

$v \leftarrow \alpha v + g$
$\theta \leftarrow \theta - \epsilon v \Rightarrow \theta \leftarrow \theta - \epsilon (\alpha v + g)$
$\hookrightarrow \theta \leftarrow \theta - \epsilon (g + \text{constant} \cdot v)$

Note: Nesterov Accelerated Gradient (NAG)

“Lookahead”
Calculate gradient at $\theta + \alpha v$ instead of $\theta$.

Per-Parameter Adaptive Learning Rate

$\theta \leftarrow \theta - \epsilon_t \cdot g$
EWMA로 조절 (Adjust with EWMA).
보폭도 달라져야 할텐데 (Step size should also change).

Motivation

Vertical oscillations in steep directions.

AdaGrad

$\hookrightarrow$ Adapts learning rate.
Steep direction ($\frac{\partial J}{\partial \theta_i}$ large) $\rightarrow$ Slow down.
Gently sloped ($\frac{\partial J}{\partial \theta_j}$ small) $\rightarrow$ Speed up.
Formula:
- $G_t = G_{t-1} + g_t^2$
- $\theta \leftarrow \theta - \frac{\epsilon}{\sqrt{G_t + \delta}} \cdot g_t$
Issue: $G_t$ increases monotonically $\rightarrow$ Learning rate decreases to 0 (Stops learning).
(H를 안 구하고 이걸 해결할 수가 있나 - Can we solve this without finding H?)

RMSProp

AdaGrad for non-convex setting.
Gradient accumulation to EWMA (최근값만 사용 - Use only recent values).
Exponentially decaying average.
Discard extreme past.
Note: Element-wise product ($g \odot g$).

Algorithm

$r \leftarrow \rho r + (1-\rho) g \odot g$
$\theta \leftarrow \theta - \frac{\epsilon}{\sqrt{r+\delta}} \odot g$

Adam

$\hookrightarrow$ RMSProp + Momentum + Bias Correction.
First iterations: inaccurate (biased towards 0).

Steps

Calculate Gradient $g$.
Momentum: $s \leftarrow \rho_1 s + (1-\rho_1)g$
RMSProp: $r \leftarrow \rho_2 r + (1-\rho_2) g \odot g$
Bias Correction:
- $\hat{s} = \frac{s}{1-\rho_1^t}$
- $\hat{r} = \frac{r}{1-\rho_2^t}$
- (원래 value보다 조금 크게 - Slightly larger than original)
Update: $\theta \leftarrow \theta - \epsilon \frac{\hat{s}}{\sqrt{\hat{r} + \delta}}$

Learning Rate Scheduling

Gradient based: SGD, SGD+Momentum, AdaGrad, RMSProp, Adam.
$\Rightarrow$ Need gradually decrease over time!
Decay: Step decay, Exponential decay.
- Loss vs Epoch graph showing drops (decay).

Learning Rate Warm-up

Linear scaling rule: When mini-batch size $m$ increases by $k$, learning rate should increase (multiply by $k$) to converge faster (분산 정복).
Warm up: Low learning rate at start of training.
- $\hookrightarrow$ Early optimization difficulty 극복 (Overcome).
- $\hookrightarrow$ 크게 크게 가도 됨 (Can go with larger steps later).

Problem of Adaptive Learning Rate

$\hookrightarrow$ Its variance.
Variance is large in early stage $\rightarrow$ Bad local minima.
RAdam (Rectified Adam): Rectifies variance of adaptive learning rate.

Loss Landscape and Generalization

Sharp minima: Fast convergence, but poor generalization.
Flat minima: Slow convergence, better generalization.

SAM (Sharpness-Aware Minimization)

$\hookrightarrow$ Minimizes loss value & loss sharpness.
Def: $\text{Sharpness} = \max_{ \epsilon _2 \le \rho} (L(\theta+\epsilon) - L(\theta))$
Maximizes error in neighborhood, then minimizes that.
Better generalization, but additional computation.

Large-batch Training

Large batch $\rightarrow$ $\hat{g}_{t+1}$ accurate $\rightarrow$ Faster convergence.
- Pros: Decrease training time by parallelization (GPU).
But weaker regularization (less noisy batch).
- Cons: Generalization gap.
- $\hookrightarrow$ Converge to sharp minima (Smaller batch is noisy, so escape $\rightarrow$ better generalization).

Solution: Ghost Batch

Split a mini-batch into several small “ghost” batches.
Diagram:
- [Input Batch] $\rightarrow$ [Split] $\rightarrow$ [Ghost Batch 1, 2, …] $\rightarrow$ [Net]
- Ghost batch (약간의 noise 발생 - Generates slight noise).

Lecture Notes, Deep Learning

This post is licensed under CC BY 4.0 by the author.