5. CNN
1. CNN (Convolutional Neural Networks)
CNN
Neural net uses convolution in their layers.
$\hookrightarrow$ Time-series, image, video.
Designed to automatically learn spatial hierarchies of features.
Motivation
$\hookrightarrow$ CV, but MLP do not scale.
: Full connectivity $\rightarrow$ # Parameters $\uparrow$ $\rightarrow$ Generalization $\downarrow$.
CNN: Assume that input is images.
Constrain the network architecture.
$\hookrightarrow$ Can reduce # parameters.
Invariance vs. Equivariance
Invariance: $f(x) = f(s(x))$.
- (Input changes ($s$), output stays same).
Equivariance: $s(f(x)) = f(s(x))$.
- (Input changes ($s$), output changes similarly ($s$)).
Parameter Sharing
CNN: Share in space.
RNN: Share in time.
Pros: Can reduce # parameters, better generalization.
Cons: Susceptible to gradient.
Inductive Bias
$\hookrightarrow$ 가정이 세면 sample efficiency가 높음 (If assumption is strong, sample efficiency is high).
(Data 많이 필요 X - Don’t need much data).
Set of assumption of test data.
Comparisons:
MLP: Mapping can be composition of learned function.
CNN: Local translational invariant features.
Attention: Importance can be dynamically calculated.
Layers in CNN
Conv: (Most computation).
Batch Norm.
ReLU.
Pooling.
FC (Fully Connected): (Most parameters).
2. Convolution Operations
Convolution
$\hookrightarrow$ Moving average with learned weights.
Kernel: Spatial feature detector.
Convolution Layer
$Input \rightarrow Filter \rightarrow Output$
Dimensions: $6 \times 6 \times 3 \xrightarrow{3 \times 3 \times 3} 4 \times 4 \times 20$ (if 20 filters).
of filters determines depth of output volume.
Output Size Formula:
- $Output = 1 + \frac{Input + 2 \times Padding - Kernel}{Stride}$.
Example:
$7 \times 7$ input, Filter $3 \times 3$, Stride=1, Padding=0.
Output size: $1 + \frac{7+0-3}{1} = 5$
Types of Convolution Operators
(1) Pointwise Convolution ($1 \times 1$)
Depthwise linear combination.
$1 \times 1 \times C$ filters.
Depth adjustment:
$\hookrightarrow$ Inception model, Residual block.
$56 \times 56 \times 64 \xrightarrow{1 \times 1, 32} 56 \times 56 \times 32$ (Dimension reduction).
Ex. Inception Module
$\hookrightarrow$ Multiple scale in parallel.
Concatenation.
1x1 conv used for Dimension Reduction (Bottleneck).
(2) Transposed Convolution
$\hookrightarrow$ Decoding layer of a convolutional autoencoder (Opposite of normal conv).
Ex. Classification (Encode) $\rightarrow$ Pixel-wise Decode.
How to upsample?
Rule-based: Nearest, Bed of nails, Max unpooling.
Learnable: Transposed Conv.
Smaller map $\rightarrow$ Larger map.
Not an inverse of conv ($Conv^{-1}$ 아님).
씩 Stride 한다는 idea (Idea of striding the input).
(3) Dilated Convolution
Parameter: Dilation rate.
$\hookrightarrow$ Space between values in a filter.
$3 \times 3$ filter acts like $5 \times 5$ filter (with holes).
$\hookrightarrow$ A wider field of view at the same computational cost.
(4) Depthwise Separable Convolution
Goal: Reduce # of parameters and computation.
Steps:
Depthwise conv: Spatial convolution per channel.
Pointwise conv: 1x1 convolution to mix channels.
Comparison:
Regular: $128 \times (3 \times 3 \times 3) \times (5 \times 5)$.
Depthwise Separable: $(3 \times 3 \times 1) \times 3 \times (5 \times 5) + (1 \times 1 \times 3) \times 128 \times (5 \times 5)$.
Generalize: $K^2 > \frac{N-1}{N}$ (Effective when N is large).
$\hookrightarrow$ Fewer kernel parameters.
(5) 3D Convolution
(6) Grouped Convolution
- Originally proposed for memory distribution.
(7) Shuffled Grouped Convolution
Channel 간 Dependency 반영 (Reflect dependency between channels).
Problem with Grouped: Information flow between channel groups X.
Solution: Shuffle channel.
(8) Convolution by Frequency Domain Conversion
$\hookrightarrow$ 빠를 때도 있음 (Sometimes faster).
Fourier Transform: $\omega(t) * x(t) \rightarrow W(j\omega)X(j\omega) \rightarrow \mathcal{F}^{-1}$.
3. Architectures
ResNet
Fit to $F(x) = H(x) - x$, not $H(x)$.
$H(x) = F(x) + x$ (Residual Connection).
$\hookrightarrow$ Nice gradient flow.
- $\frac{\partial F}{\partial x} + 1$ (Gradient highway).
Pre-activation ResNet
Improved residual block design.
$ReLU \rightarrow Conv$ (Pre-activation) instead of $Conv \rightarrow ReLU$.
$\hookrightarrow$ Creates more ‘direct path’.
Better for training deeper nets.
Uses Bottleneck layers for efficiency (1x1 conv).
DenseNet
Densely connected CNN.
Dense blocks: Each layer is connected to every other layer.
$\hookrightarrow$ Alleviates vanishing gradient.
$\hookrightarrow$ Strengthen propagation.
Squeeze-and-Excitation Networks (SENet)
Base: ResNeXt.
SE Block:
Squeeze: Global information embedding.
- Global Average Pooling (GAP): $H \times W \times C \rightarrow 1 \times 1 \times C$.
Excitation: Adaptive recalibration.
$FC \rightarrow ReLU \rightarrow FC \rightarrow Sigmoid$.
Compress $\rightarrow$ Decompress.
Main Idea: Improve representational power of a network by modeling interdependencies between channels of conv features.
Reweight: $F_{scale} = U \otimes \sigma(\hat{U}) = \tilde{X}$ (색 입히기 - Coloring/Reweighting).
U-Net
$\hookrightarrow$ Encoder-Decoder arch. with Skip connection.
Structure: (Contracting path) + (Expanding path).
Capture context + Enable precise location.
Fine-grained 위치 정보 (Positional information).
MobileNet
$\hookrightarrow$ Efficient CNN for mobile.
Uses Depthwise Separable Conv.
More efficient.
Previous: Reduce # of parameters (e.g., SqueezeNet).
Now: Reduce # of operations (for performance).
EfficientNet
CNN: Fixed resource budget $\rightarrow$ Scale up for better accuracy.
EfficientNet: Authors systemically study model scaling.
- Balancing Depth, Width, Resolution.
Compound Scaling:
$d = \alpha^\phi$
$w = \beta^\phi$
$r = \gamma^\phi$
s.t. $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$.