17. LLM Development

Posted Dec 7, 2025

By Mingyu An

4 min read

17. LLM Development

Pretraining

Text $\rightarrow$ Text model
1~3% unstructured data
Transformer 사용방법
- ↳ Encoder
- ↳ Decoder
- ↳ Encoder-Decoder

(1) Encoder-only model (Autoencoding)

[Mask]
Goal: Reconstruct text (Bidirectional)
Good at: Sentiment analysis.
Ex: BERT, RoBERTa

(2) Decoder-only (Autoregressive)

[Mask] (Forward masking)
Goal: Predict next token (Unidirectional)
↳ Emergent behavior
Ex: GPT, BLOOM

(3) Encoder-Decoder model (Seq-to-Seq model)

↳ Recon span.
Ex: T5, BART

LLM Development

(1) Prompt Engineering

① Direct prompting (Zero shot)
- ↳ No ex. Just instruction
② One shot
- ↳ One clear example
③ Few- / Multi-shot prompting
- Better at complex task
- But, LLM can not do complex task prompting alone.
- Revise the prompt $\rightarrow$ Chain-of-Thought Prompting
Chain-of-Thought Prompting
- ↳ Encourage LM to reasoning
- $\Rightarrow$ 내 사고 과정을 보여주자. (Example 없이도 되나?)
- Add instruction, feature, architecture, objective prompt.

(2) Instruction Tuning

Motive: Base LLM are not aligned with users’ intent.
- ↳ We need alignment to bridge the gap!
Approaches
- ① Zero-/Few-shot prompting for in-context learning (Due to emergent ability)
  - ↳ Tuning-free
  - ↳ Limit to what you fit in context-length
- ② SFT (Tuning-based alignment)
  - Gradient updates

Instruction Tuning Process

Unsupervised Pretrained Base LLM (GPT) $\xrightarrow{\text{Supervised}}$ Fine-tuned LLM.
Instruction tuning (First fine-tuning)
- Instructional prompt + LLM Output $\rightarrow$ Outputs
- $\Rightarrow$ 하나만 해서는 안되고, in-general 해야 함
Process:
- Prompt $\rightarrow$ Pre-trained LLM $\rightarrow$ Label
- Loss: CE (Cross Entropy)
- Supervised Learning.
Catastrophic forgetting
- ↳ 하나만 학습시키면 멍청해짐.
- Fine-tune on multiple tasks
- ↳ Multi-task instruction tuning.

Parameter-efficient Fine tuning (PEFT) Comparison

① Supervised (SFT, BERT, T5): Base LM $\rightarrow$ Fine-tune on A, A
② Prompting for in-context learning (GPT-3): Base LM
③ Instruction tuning (FLAN):
- Tasks A, B, C, D $\rightarrow$ Base LM
- Inference of unseen.

FLAN & Human Preference

FLAN
- ↳ Set of instruction for instruction tuning.
- Pretraining의 dessert 느낌.
Limitation
- ① Expensive to collect truth data
- ② Some tasks has no answer
- ③ Human generates suboptimal
Optimizing Human Preference (Second fine tuning)
- LM objective $\ne$ Human preference
Align with human preference
- ↳ Helpful, Honest, Harmless
- : Instruction tuning + Preference tuning

RLHF (Reinforcement Learning from Human Feedback)

Instruction $\rightarrow$ Human feedback $\rightarrow$ Human-aligned
Reinforcement Learning
- ↳ Train an agent to make seq. to decision maximizing a cumulative reward.
Steps in RLHF
- ① Supervised fine tuning (SFT)
- ② Human feedback + Reward model
  - By rank
  - ↳ RM predicts by score
- ③ Fine-tuning with RL
  - ↳ Using RL algorithm (PPO)

RLHF Details

(1) Supervised fine-tuning (SFT)

① Sampled from a prompt dataset
② Human labeler writes response
③ Fine-tune LLM $\rightarrow$ Instruct LLM

(2) Reward Model

① Human labeler ranks outputs
② Train RM
Bradley-Terry (BT) preference model
- ↳ Pairwise comparisons
- $P(i>j) = \frac{\exp(\beta_i)}{\exp(\beta_i) + \exp(\beta_j)} = \frac{1}{1 + e^{-(\beta_i - \beta_j)}}$
- (Softmax $\rightarrow$ Sigmoid)
- $= \sigma(\beta_i - \beta_j)$
- $\therefore \mathcal{L} = -\log \sigma(\beta_i - \beta_j)$
- ↳ Pairwise comparison for n-completion.
- $nC_2 \rightarrow$ Change $(x, y_w, y_l)$
  - $(x, y_w) \rightarrow (RM) \rightarrow r_\phi(x, y_w) = r_w$
  - $(x, y_l) \rightarrow (RM) \rightarrow r_\phi(x, y_l) = r_l$
  - Loss: $-\log \sigma(r_w - r_l)$
- Gradient update

(3) Fine-tuning

Optimize policy with PPO.
Updated LLM is Instruct LLM.
② Updates LLM by RL (PPO)
- GPT uses PPO (Proximal Policy Optimization) RL.

Other Optimization Methods

DPO (Direct Preference Optimization) (Not RL)
- Tunes LLM without RL
- $\mathcal{L}_{DPO} = -\log \sigma(score(y_w) - score(y_l))$ : Binary cross entropy loss
GRPO
- ↳ Reasoning fine-tuning.
- $\Rightarrow$ Can use RLVR (RL w. verifiable rewards) by PPO.
- (No additional value neural net)
- ↳ Sample $k$ response $y_1, \cdot\cdot\cdot, y_k \sim \Pi_\theta(\cdot x)$
- $r_i = R(x, y_i)$
- ↳ Compute relative advantage

Parameter-efficient fine tuning (PEFT)

↳ Fine tune LLM with a minimal parameter
Methods:
- ① Selective: Trainable layers
- ② Reparameterization (LoRA): Model updated.
- ③ Additive (Adapters, Prompt tuning): Prompt engineering (In-context)

LoRA (Low-Rank Adaptation)

Low-rank matrices
$W_{dec} = W + \Delta W$
$W_{enc}$
Structure: $x \rightarrow W \rightarrow h$, plus path $x \rightarrow A \rightarrow r \rightarrow B \rightarrow +$
Example:
- $W \in \mathbb{R}^{d \times k}$ where $d=512, k=64$
- LoRA with rank $r=8$
- $B = 512 \times 8$
- $A = 8 \times 64$
Reduction:
- $\frac{512 \times 8 + 8 \times 64}{512 \times 64} = \frac{1}{8}$
- : 87.5% reduction

Prompt tuning

↳ 추가 Token을 주기 (매 Task 마다 X)

RAG (Retrieval-Augmented Generation)

Knowledge cut-offs
Hybrid approach in LLM
Flow:
- Prompt $\rightarrow$ Query Encoder $\rightarrow$ Vector Database (Retrieved Info) $\rightarrow$ [R x] $\rightarrow$ LLM
- Non-parametric knowledge (R) + Parametric (LLM)
Process:
- (1) Indexing
  - ① Document chunk size $C$.
  - ② Embed & store each chunk to vector embedding. Size $V$.
- (2) Retrieval
  - For every query ($h \rightarrow C \times V$)
  - ① Embed query: $Q \rightarrow$ Vector
  - ② Retrieve Context
    - Use ANN to find closest vectors
    - $K$-chunks 가져옴.
  - ${ C \times K, Q } \rightarrow \text{LLM} \rightarrow \text{Final Answer}$
  - [Original Chunks $\times K$] + [Original Query]

RAG vs Fine-tuning

Feature	RAG	Fine-tuning
Knowledge	External knowledge	New skill
Cost	Easy, Cheap	Expensive
Transparency	Highly verifiable	Opaque

Lecture Notes, Deep Learning

This post is licensed under CC BY 4.0 by the author.

Pretraining

(1) Encoder-only model (Autoencoding)

(2) Decoder-only (Autoregressive)

(3) Encoder-Decoder model (Seq-to-Seq model)

LLM Development

(1) Prompt Engineering

(2) Instruction Tuning

Instruction Tuning Process

Parameter-efficient Fine tuning (PEFT) Comparison

FLAN & Human Preference

RLHF (Reinforcement Learning from Human Feedback)

RLHF Details

(1) Supervised fine-tuning (SFT)

(2) Reward Model

(3) Fine-tuning

Other Optimization Methods

Parameter-efficient fine tuning (PEFT)

LoRA (Low-Rank Adaptation)

Prompt tuning

RAG (Retrieval-Augmented Generation)

RAG vs Fine-tuning

Trending Tags