Post

5. CNN

5. CNN

1. CNN (Convolutional Neural Networks)

  • CNN

    • Neural net uses convolution in their layers.

    • $\hookrightarrow$ Time-series, image, video.

    • Designed to automatically learn spatial hierarchies of features.

  • Motivation

    • $\hookrightarrow$ CV, but MLP do not scale.

    • : Full connectivity $\rightarrow$ # Parameters $\uparrow$ $\rightarrow$ Generalization $\downarrow$.

    • CNN: Assume that input is images.

    • Constrain the network architecture.

    • $\hookrightarrow$ Can reduce # parameters.

  • Invariance vs. Equivariance

    1. Invariance: $f(x) = f(s(x))$.

      • (Input changes ($s$), output stays same).
    2. Equivariance: $s(f(x)) = f(s(x))$.

      • (Input changes ($s$), output changes similarly ($s$)).
  • Parameter Sharing

    • CNN: Share in space.

    • RNN: Share in time.

    • Pros: Can reduce # parameters, better generalization.

    • Cons: Susceptible to gradient.

  • Inductive Bias

    • $\hookrightarrow$ 가정이 세면 sample efficiency가 높음 (If assumption is strong, sample efficiency is high).

    • (Data 많이 필요 X - Don’t need much data).

    • Set of assumption of test data.

    • Comparisons:

      • MLP: Mapping can be composition of learned function.

      • CNN: Local translational invariant features.

      • Attention: Importance can be dynamically calculated.

  • Layers in CNN

    1. Conv: (Most computation).

    2. Batch Norm.

    3. ReLU.

    4. Pooling.

    5. FC (Fully Connected): (Most parameters).


2. Convolution Operations

  • Convolution

    • $\hookrightarrow$ Moving average with learned weights.

    • Kernel: Spatial feature detector.

  • Convolution Layer

    • $Input \rightarrow Filter \rightarrow Output$

    • Dimensions: $6 \times 6 \times 3 \xrightarrow{3 \times 3 \times 3} 4 \times 4 \times 20$ (if 20 filters).

    • of filters determines depth of output volume.

    • Output Size Formula:

      • $Output = 1 + \frac{Input + 2 \times Padding - Kernel}{Stride}$.
    • Example:

      • $7 \times 7$ input, Filter $3 \times 3$, Stride=1, Padding=0.

      • Output size: $1 + \frac{7+0-3}{1} = 5$

  • Types of Convolution Operators

    (1) Pointwise Convolution ($1 \times 1$)

    • Depthwise linear combination.

    • $1 \times 1 \times C$ filters.

    • Depth adjustment:

      • $\hookrightarrow$ Inception model, Residual block.

      • $56 \times 56 \times 64 \xrightarrow{1 \times 1, 32} 56 \times 56 \times 32$ (Dimension reduction).

    Ex. Inception Module

    • $\hookrightarrow$ Multiple scale in parallel.

    • Concatenation.

    • 1x1 conv used for Dimension Reduction (Bottleneck).

    (2) Transposed Convolution

    • $\hookrightarrow$ Decoding layer of a convolutional autoencoder (Opposite of normal conv).

    • Ex. Classification (Encode) $\rightarrow$ Pixel-wise Decode.

    • How to upsample?

      1. Rule-based: Nearest, Bed of nails, Max unpooling.

      2. Learnable: Transposed Conv.

        • Smaller map $\rightarrow$ Larger map.

        • Not an inverse of conv ($Conv^{-1}$ 아님).

        • 씩 Stride 한다는 idea (Idea of striding the input).

    (3) Dilated Convolution

    • Parameter: Dilation rate.

    • $\hookrightarrow$ Space between values in a filter.

    • $3 \times 3$ filter acts like $5 \times 5$ filter (with holes).

    • $\hookrightarrow$ A wider field of view at the same computational cost.

    (4) Depthwise Separable Convolution

    • Goal: Reduce # of parameters and computation.

    • Steps:

      1. Depthwise conv: Spatial convolution per channel.

      2. Pointwise conv: 1x1 convolution to mix channels.

    • Comparison:

      • Regular: $128 \times (3 \times 3 \times 3) \times (5 \times 5)$.

      • Depthwise Separable: $(3 \times 3 \times 1) \times 3 \times (5 \times 5) + (1 \times 1 \times 3) \times 128 \times (5 \times 5)$.

      • Generalize: $K^2 > \frac{N-1}{N}$ (Effective when N is large).

      • $\hookrightarrow$ Fewer kernel parameters.

    (5) 3D Convolution

    (6) Grouped Convolution

    • Originally proposed for memory distribution.

    (7) Shuffled Grouped Convolution

    • Channel 간 Dependency 반영 (Reflect dependency between channels).

    • Problem with Grouped: Information flow between channel groups X.

    • Solution: Shuffle channel.

    (8) Convolution by Frequency Domain Conversion

    • $\hookrightarrow$ 빠를 때도 있음 (Sometimes faster).

    • Fourier Transform: $\omega(t) * x(t) \rightarrow W(j\omega)X(j\omega) \rightarrow \mathcal{F}^{-1}$.


3. Architectures

ResNet

  • Fit to $F(x) = H(x) - x$, not $H(x)$.

  • $H(x) = F(x) + x$ (Residual Connection).

  • $\hookrightarrow$ Nice gradient flow.

    • $\frac{\partial F}{\partial x} + 1$ (Gradient highway).

Pre-activation ResNet

  • Improved residual block design.

  • $ReLU \rightarrow Conv$ (Pre-activation) instead of $Conv \rightarrow ReLU$.

  • $\hookrightarrow$ Creates more ‘direct path’.

  • Better for training deeper nets.

  • Uses Bottleneck layers for efficiency (1x1 conv).

DenseNet

  • Densely connected CNN.

  • Dense blocks: Each layer is connected to every other layer.

  • $\hookrightarrow$ Alleviates vanishing gradient.

  • $\hookrightarrow$ Strengthen propagation.

Squeeze-and-Excitation Networks (SENet)

  • Base: ResNeXt.

  • SE Block:

    1. Squeeze: Global information embedding.

      • Global Average Pooling (GAP): $H \times W \times C \rightarrow 1 \times 1 \times C$.
    2. Excitation: Adaptive recalibration.

      • $FC \rightarrow ReLU \rightarrow FC \rightarrow Sigmoid$.

      • Compress $\rightarrow$ Decompress.

  • Main Idea: Improve representational power of a network by modeling interdependencies between channels of conv features.

  • Reweight: $F_{scale} = U \otimes \sigma(\hat{U}) = \tilde{X}$ (색 입히기 - Coloring/Reweighting).

U-Net

  • $\hookrightarrow$ Encoder-Decoder arch. with Skip connection.

  • Structure: (Contracting path) + (Expanding path).

  • Capture context + Enable precise location.

  • Fine-grained 위치 정보 (Positional information).

MobileNet

  • $\hookrightarrow$ Efficient CNN for mobile.

  • Uses Depthwise Separable Conv.

  • More efficient.

    • Previous: Reduce # of parameters (e.g., SqueezeNet).

    • Now: Reduce # of operations (for performance).

EfficientNet

  • CNN: Fixed resource budget $\rightarrow$ Scale up for better accuracy.

  • EfficientNet: Authors systemically study model scaling.

    • Balancing Depth, Width, Resolution.
  • Compound Scaling:

    • $d = \alpha^\phi$

    • $w = \beta^\phi$

    • $r = \gamma^\phi$

    • s.t. $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$.

This post is licensed under CC BY 4.0 by the author.