5. CNN

Posted Oct 12, 2025

By Mingyu An

4 min read

5. CNN

1. CNN (Convolutional Neural Networks)

CNN
- Neural net uses convolution in their layers.
- $\hookrightarrow$ Time-series, image, video.
- Designed to automatically learn spatial hierarchies of features.
Motivation
- $\hookrightarrow$ CV, but MLP do not scale.
- : Full connectivity $\rightarrow$ # Parameters $\uparrow$ $\rightarrow$ Generalization $\downarrow$.
- CNN: Assume that input is images.
- Constrain the network architecture.
- $\hookrightarrow$ Can reduce # parameters.
Invariance vs. Equivariance
1. Invariance: $f(x) = f(s(x))$.
  - (Input changes ($s$), output stays same).
2. Equivariance: $s(f(x)) = f(s(x))$.
  - (Input changes ($s$), output changes similarly ($s$)).
Parameter Sharing
- CNN: Share in space.
- RNN: Share in time.
- Pros: Can reduce # parameters, better generalization.
- Cons: Susceptible to gradient.
Inductive Bias
- $\hookrightarrow$ 가정이 세면 sample efficiency가 높음 (If assumption is strong, sample efficiency is high).
- (Data 많이 필요 X - Don’t need much data).
- Set of assumption of test data.
- Comparisons:
  - MLP: Mapping can be composition of learned function.
  - CNN: Local translational invariant features.
  - Attention: Importance can be dynamically calculated.
Layers in CNN
1. Conv: (Most computation).
2. Batch Norm.
3. ReLU.
4. Pooling.
5. FC (Fully Connected): (Most parameters).

2. Convolution Operations

Convolution
- $\hookrightarrow$ Moving average with learned weights.
- Kernel: Spatial feature detector.
Convolution Layer
- $Input \rightarrow Filter \rightarrow Output$
- Dimensions: $6 \times 6 \times 3 \xrightarrow{3 \times 3 \times 3} 4 \times 4 \times 20$ (if 20 filters).
- of filters determines depth of output volume.
- Output Size Formula:
  - $Output = 1 + \frac{Input + 2 \times Padding - Kernel}{Stride}$.
- Example:
  - $7 \times 7$ input, Filter $3 \times 3$, Stride=1, Padding=0.
  - Output size: $1 + \frac{7+0-3}{1} = 5$
Types of Convolution Operators
(1) Pointwise Convolution ($1 \times 1$)
- Depthwise linear combination.
- $1 \times 1 \times C$ filters.
- Depth adjustment:
  - $\hookrightarrow$ Inception model, Residual block.
  - $56 \times 56 \times 64 \xrightarrow{1 \times 1, 32} 56 \times 56 \times 32$ (Dimension reduction).
Ex. Inception Module
- $\hookrightarrow$ Multiple scale in parallel.
- Concatenation.
- 1x1 conv used for Dimension Reduction (Bottleneck).
(2) Transposed Convolution
- $\hookrightarrow$ Decoding layer of a convolutional autoencoder (Opposite of normal conv).
- Ex. Classification (Encode) $\rightarrow$ Pixel-wise Decode.
- How to upsample?
  1. Rule-based: Nearest, Bed of nails, Max unpooling.
  2. Learnable: Transposed Conv.
    - Smaller map $\rightarrow$ Larger map.
    - Not an inverse of conv ($Conv^{-1}$ 아님).
    - 씩 Stride 한다는 idea (Idea of striding the input).
(3) Dilated Convolution
- Parameter: Dilation rate.
- $\hookrightarrow$ Space between values in a filter.
- $3 \times 3$ filter acts like $5 \times 5$ filter (with holes).
- $\hookrightarrow$ A wider field of view at the same computational cost.
(4) Depthwise Separable Convolution
- Goal: Reduce # of parameters and computation.
- Steps:
  1. Depthwise conv: Spatial convolution per channel.
  2. Pointwise conv: 1x1 convolution to mix channels.
- Comparison:
  - Regular: $128 \times (3 \times 3 \times 3) \times (5 \times 5)$.
  - Depthwise Separable: $(3 \times 3 \times 1) \times 3 \times (5 \times 5) + (1 \times 1 \times 3) \times 128 \times (5 \times 5)$.
  - Generalize: $K^2 > \frac{N-1}{N}$ (Effective when N is large).
  - $\hookrightarrow$ Fewer kernel parameters.
(5) 3D Convolution
(6) Grouped Convolution
- Originally proposed for memory distribution.
(7) Shuffled Grouped Convolution
- Channel 간 Dependency 반영 (Reflect dependency between channels).
- Problem with Grouped: Information flow between channel groups X.
- Solution: Shuffle channel.
(8) Convolution by Frequency Domain Conversion
- $\hookrightarrow$ 빠를 때도 있음 (Sometimes faster).
- Fourier Transform: $\omega(t) * x(t) \rightarrow W(j\omega)X(j\omega) \rightarrow \mathcal{F}^{-1}$.

3. Architectures

ResNet

Fit to $F(x) = H(x) - x$, not $H(x)$.
$H(x) = F(x) + x$ (Residual Connection).
$\hookrightarrow$ Nice gradient flow.
- $\frac{\partial F}{\partial x} + 1$ (Gradient highway).

Pre-activation ResNet

Improved residual block design.
$ReLU \rightarrow Conv$ (Pre-activation) instead of $Conv \rightarrow ReLU$.
$\hookrightarrow$ Creates more ‘direct path’.
Better for training deeper nets.
Uses Bottleneck layers for efficiency (1x1 conv).

DenseNet

Densely connected CNN.
Dense blocks: Each layer is connected to every other layer.
$\hookrightarrow$ Alleviates vanishing gradient.
$\hookrightarrow$ Strengthen propagation.

Squeeze-and-Excitation Networks (SENet)

Base: ResNeXt.
SE Block:
1. Squeeze: Global information embedding.
  - Global Average Pooling (GAP): $H \times W \times C \rightarrow 1 \times 1 \times C$.
2. Excitation: Adaptive recalibration.
  - $FC \rightarrow ReLU \rightarrow FC \rightarrow Sigmoid$.
  - Compress $\rightarrow$ Decompress.
Main Idea: Improve representational power of a network by modeling interdependencies between channels of conv features.
Reweight: $F_{scale} = U \otimes \sigma(\hat{U}) = \tilde{X}$ (색 입히기 - Coloring/Reweighting).

U-Net

$\hookrightarrow$ Encoder-Decoder arch. with Skip connection.
Structure: (Contracting path) + (Expanding path).
Capture context + Enable precise location.
Fine-grained 위치 정보 (Positional information).

MobileNet

$\hookrightarrow$ Efficient CNN for mobile.
Uses Depthwise Separable Conv.
More efficient.
- Previous: Reduce # of parameters (e.g., SqueezeNet).
- Now: Reduce # of operations (for performance).

EfficientNet

CNN: Fixed resource budget $\rightarrow$ Scale up for better accuracy.
EfficientNet: Authors systemically study model scaling.
- Balancing Depth, Width, Resolution.
Compound Scaling:
- $d = \alpha^\phi$
- $w = \beta^\phi$
- $r = \gamma^\phi$
- s.t. $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$.

Lecture Notes, Deep Learning

This post is licensed under CC BY 4.0 by the author.

1. CNN (Convolutional Neural Networks)

2. Convolution Operations

of filters determines depth of output volume.

3. Architectures

ResNet

DenseNet

Squeeze-and-Excitation Networks (SENet)

U-Net

MobileNet

EfficientNet

Trending Tags