1. Introduction
I. intro.
AI Venn Diagram
AI: Create a machine that can think like humans.
(Contains) Machine learning
(Contains) Representation learning
- (Contains) Deep learning
Comparison
(1) Rule-based systems
- Input $\rightarrow$ Hand-designed program $\rightarrow$ Output
(2) Classic machine learning
Input $\rightarrow$ Hand-designed features $\rightarrow$ Mapping $\rightarrow$ Output
- human selects feature
(3) Representation learning
Input $\rightarrow$ Features $\rightarrow$ Mapping $\rightarrow$ Output
: able to learn from data
$Input \rightarrow Features \rightarrow Mapping \rightarrow Output$
(3.1) Deep learning
Input $\rightarrow$ Simple features $\rightarrow$ additional layers of more abstract features $\rightarrow$ Mapping $\rightarrow$ Output
- $\hookrightarrow$ end-to-end 자동화
Three waves
(1) Cybernetics
Perceptron: $g(x) = Sign(\sum_{i=1}^{d} w_{i}x_{i} + b)$
$\delta$-rule: $w \leftarrow w + \eta \cdot \epsilon \cdot x$
- 오차계산 후 $\epsilon$에 따라 반영. (Reflect based on $\epsilon$ after calculating error)
Stochastic gradient descent
: mini-batch 만보고 경사하강. (Gradient descent looking only at mini-batch)
Randomly descent
(2) Connectionism
Neural nets: hierarchical MLP
Distributed representation
Back propagation
not one hot encoding (e.g., apple = [0.85, 0.2, …])
Can be generalized
: algorithm for MLP $\delta$-rule. (chain rule)
(3) Status quo
Minimize loss function.
Subhuman: general intelligence
Human-level: perception tasks
Superhuman: domains with structured big data
I. Review
Machine learning
Learning from data.
[Flowchart Transcription]
Unknown target distribution: $P(y x)$ (annotated: 정답) Unknown input distribution: $P(x)$ (annotated: input)
- $\downarrow$ generates $x_1, …, x_N$
Training examples: $(x_1, y_1) \dots (x_N, y_N)$
- $\downarrow$
Learning algorithm $\mathcal{A}$
$\uparrow$ takes input from Hypothesis set $\mathcal{H}$
$\rightarrow$ outputs
- Final hypothesis $g$: $g(x) \approx f(x)$
Tasks in ML
Described in terms of how to process an example
Collection of features: $x \in \mathbb{R}^n$
- $x_i$: a feature
Performance measure
Task dependent
- Classification: error rate $E$
Evaluated using datasets
Training: $E_{train}$
Density estimation: avg. log-probability the model assigns to examples
likelihood가 높을 수록 좋은데, (Higher likelihood is better,)
$L(\theta) = \prod_{i=1}^{N} p_{\theta}(x^{(i)})$
$\sum \log L(\theta)$
Dev: $E_{dev}$
Test: $E_{test}$
Challenges
Frequent? Large?
What to measure?
Measurement is hard
Data sets
Test: $E_{test} \approx E_{gen} = E_{out}$ (for generalization)
Val: $E_{val}$ ($E_{dev}$) (for model selection or hyperparameter tuning)
Train: $E_{train}$ (for model fitting)
e.g., k-fold cross validation
Diagram: [Splits data into k parts]
assume: 3개의 data 분포가 동일 (assume: 3 data distributions are identical)
k번 계산량 늘지만, 나누는 것의 penalty 줄고 더 많은 양이 train 된다. (Calculation increases k times, but penalty of splitting decreases and more amount is trained.)
1. Theory of Generalization
Capacity of model
$C =$ model fit various functions
Underfitting: Capacity $\downarrow$
Overfitting: Capacity $\uparrow$
Appropriate
Choosing a model
- $\hookrightarrow$ Parsimony, choose Simplest one!
VC generalization bound
For any $\epsilon > 0$, $N > 0$
$ P[ E_{train}(g) - E_{test}(g) > \epsilon] \le (2N)^{d_{VC}} \cdot e^{-\frac{1}{8}\epsilon^{2}N}$ - $\hookrightarrow$ bad event 확률 (probability of bad event)
Central Challenge
$E_{gen} \approx 0$
- $\downarrow$ (VC bound)
$E_{test} \approx 0$
- $\downarrow$
Underfitting vs Overfitting
Analysis of Error
Generalization gap: $E_{test} \approx E_{train}$
gap이 얼마나 적은가 (How small is the gap)
If not: Overfitting (due to variance)
$\rightarrow$ Regularization, less complex model, more data
Approximation ($E_{train}$)
모델이 충분히 복잡해야한다. (Model must be sufficiently complex.)
If not: Underfitting (due to bias)
$\rightarrow$ Better optimization, more complex model
Augmented Error
$E_{aug}(\theta) = E_{train}(\theta) + \lambda \cdot \Omega(\theta)$
- $\Omega(\theta)$: Regularizer
Choosing a Model (Continued)
Complex model + big data + effective regularizer
$\Rightarrow$ Hinges on double descent phenomena
[Graph: Model Complexity vs Error]
$E_{test}$ (U-shaped curve) - annotated “modeled noise”
$E_{train}$ (Decreasing curve) - annotated “unmodeled data”
X-axis: Model Complexity ($\lambda^{-1}$)
Linear models
Advantages
Simplicity
Generalization: $E_{out} \approx E_{train}$
Extension: Can solve classification, regression, probability estimation
Formalize binary classification
$\mathcal{X} = \mathbb{R}^d$ ($x \in \mathcal{X}: x = (x_1, x_2, \dots, x_d)$ input)
$\mathcal{Y} = {+1, -1}$ output space
Note: $f: \mathcal{X} \rightarrow \mathcal{Y}$ ideal approval formula (target function)
$g: \mathcal{X} \rightarrow \mathcal{Y}$: hypothesis
Decision making
$\sum_{i=1}^{d} w_i x_i > threshold$ (approve)
- (Weighted coordinate)
Perceptron (compact ver.)
$g(x) = Sign[(\sum_{i=1}^{d} w_i x_i) - threshold]$
$= Sign((\sum_{i=1}^{d} w_i x_i) + b)$
$= Sign(w^T x + b)$
where $b$ is called “bias”
Parameters $\theta := (w_1, w_2, \dots, w_d, b)$
- $\hookrightarrow$ yield different hyperplanes
Optimization
- $\nabla E_{train}(w) = 0$
Loss functions
0-1 Loss: 다르면 1 (1 if different)
BCE (Binary Cross Entropy)
y: label (정답)
$\hat{y}$: model output
def: $-y \log \hat{y} - (1-y) \log (1-\hat{y})$
Logistic: $\log(1 + e^{-y\hat{y}})$
Optimization Example (Linear Regression / Least Squares)
Problem: $\min y - (w^T x) ^2$ Logistic regression uses Sigmoid: $G(z) = \frac{1}{1+e^{-z}}$
Solving $\nabla E_{train}(w) = 0$:
$J(w) = \sum_{i=1}^{d} (y_i - w_i x_i)^2$
$\frac{dJ}{dw} = \sum_{i=1}^{d} 2(y_i - w_i x_i) \cdot (-x_i)$
$= -2 \sum_{i=1}^{d} (x_i y_i - w_i x_i^2) = 0$
$\therefore w_i = \frac{\sum x_i y_i}{\sum x_i^2}$
Why BCE?
$H(p, q) = -\sum p_i \log q_i$ (= Cross Entropy)
- $C=\dots$ 예측했을때 P 설명하는 비용 (Cost of explaining P when predicting…)
If $y \in {0, 1}$:
$p \in [y, 1-y]$
$q \in [\hat{y}, 1-\hat{y}]$
$H(p, q) = -y \log \hat{y} - (1-y) \log(1-\hat{y})$
Loss for multiclass
Model $\rightarrow$ Softmax $\rightarrow$ Cross Entropy Loss
$J_{CE}(y, \hat{y}) \triangleq -\sum_{i=1}^{k} y_i \log \hat{y}i = -y{answer} \cdot \log \hat{y}_{answer}$
Activation functions
ReLU
Leaky ReLU
GELU
Elements of neural nets
Basic Unit
$X \rightarrow$ Weighted Sum (Linear) $\rightarrow$ Activation Function (Possibly non-linear) $\rightarrow$ Output
- (Feedback loop marked as “feed back”)
MLP: $w^T X + b$
CNN: $w * X$
Encoder, Decoder
- $x$ $\rightarrow$ Encoder (input summarizer) $\rightarrow$ $h$ (efficient repre) $\rightarrow$ Decoder (output generator) $\rightarrow$ $y$