8. Transformer Basic

Posted Oct 17, 2025

By Mingyu An

3 min read

1. Introduction

RNN and LSTM era
- Fixed size context vector forced to summarize long sequences.
- Vanishing/Exploding gradient.
Attention
- $\hookrightarrow$ Selectively focus on relevant parts.
Introducing the Transformer
- $\hookrightarrow$ “Attention is all you need”.
- Enable massive parallelism (in training).
- (1) Compute path length.
Transformer: A Seq-to-Seq model
- Encoder $\rightarrow$ Decoder.
- [Stack of Encoders] $\rightarrow$ [Stack of Decoders].
- Input (I am a student) $\rightarrow$ Output (Je suis étudiant).

All identical in structure (do not share weights).
Structure:
1. Self-Attention: Help encoder look at other words in the input sentence as it encodes a specific word. (Short-term memory).
2. Feed Forward Neural Network: Non-linear. The exact same feed-forward net is independently applied to each position. (Long-term memory).
Preprocessing
- Tokenization: Text $\rightarrow$ Token.
- Word Embedding: Token $\rightarrow$ Vector.
  - Uses a real-valued dense vector to represent each word.
  - (Learned separately from Transformer).
  - 기존에는 Count 기반이라 너무 sparse (Previously count-based, too sparse).
  - Allows words with a similar meaning have a similar representation.
  - (e.g., King-Man+Woman = Queen).
- Applied for first encoder only.
Flow:
- Input $\rightarrow$ Word Embedding $\rightarrow$ [Self-Attention $\rightarrow$ Z (Contextual part) $\rightarrow$ FFNN] $\rightarrow$ Output.

Goal: Embed a vector for each word.
Self-Attention:
- “The animal didn’t cross the street because it was too tired.”
- $\hookrightarrow$ “it” is associated with “animal”.
Process:
1. Create 3 vectors: Query ($q$), Key ($k$), Value ($v$).
  - Multiply input embedding $X$ by weight matrices $W^Q, W^K, W^V$.
2. Calculate Score:
  - $q \cdot k$ (Dot product).
  - Score determines focus.
3. Divide by $\sqrt{d_k}$:
  - To have stable gradients. (Scale).
4. Softmax:
  - Normalize scores to probability distribution.
5. Multiply by Value ($v$):
  - Keep intact values of focused words, drown-out irrelevant words.
6. Sum:
  - $Z = \sum v \times softmax(\dots)$.

Matrix Calculation of Self-Attention

Motivation:
1. Expand model’s ability to focus on different positions.
2. Give attention layer multiple “representation subspaces”.
Mechanism:
- Have multiple sets of Query/Key/Value weight matrices ($W_0, \dots, W_7$).
- Calculate $Z_0, \dots, Z_7$ separately.
- Concatenate all $Z$ matrices.
- Multiply by an additional weights matrix $W^O$.
- Result: Final $Z$ matrix captures information from all heads.

Positional Encoding

Transformer doesn’t use recurrence/convolution $\rightarrow$ No notion of order.
Solution: Add vector to each input embedding.
Formula:
- $PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$
- $PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$
Residual Connection & Layer Normalization:
- Each sub-layer (Self-attention, FFNN) has residual connection around it, followed by Layer Norm.
- $LayerNorm(x + Sublayer(x))$.

Structure:
1. Masked Multi-Head Attention: Self-attention on decoder output. Mask future positions.
2. Encoder-Decoder Attention (Cross-Attention): Query from Decoder, Key/Value from Encoder.
3. Feed Forward.
Masked Self-Attention:
- Prevent positions from attending to subsequent positions.
- “Illegal connection” masked out (set to $-\infty$ before softmax).
Final Linear and Softmax Layer
- Decoder output (float vector) $\rightarrow$ Word probability.
- Linear: Project to logits (size of vocab).
- Softmax: Logits $\rightarrow$ Probabilities.

(1) Greedy Decoding
- Pick the single word with highest probability at each step.
(2) Beam Search
- Keep track of $k$ (beam size) most promising sentences.
- Example ($k=2$):
  - Step 1: Select top 2 words (“I”, “The”).
  - Step 2: For each, expand next possible words, keep top 2 total paths.
- Note: $k=1$ is Greedy decoding.
(3) Stochastic Decoding
- Top-k Sampling: Sample from top-k most probable words.
  - Avoids repetition/loops.
- Top-p Sampling (Nucleus): Sample from smallest set of words whose cumulative probability exceeds $p$.
  - Dynamic $k$.
Training vs Inference
- Training: Parallelizable (Teacher Forcing).
  - Use ground truth as input to decoder.
- Inference: Sequential (Auto-regressive).
  - Output of step $t$ becomes input of step $t+1$.

This post is licensed under CC BY 4.0 by the author.