Fine-Tuning: From Base Model to Assistant

Article 8 of 8 · Series: How LLMs Work

In Article 7 we finished building the transformer. Stacks of blocks, multi-head attention, feed-forward, residuals, layer norm — and at the end a base model that can predict the next token. Impressive, but: if you tell this model “What’s the weather tomorrow?”, it won’t answer. It will complete. It might give you back something like “…and the day after? — a typical question people ask themselves when they…” The most likely continuation of the input text, nothing more.

ChatGPT, Claude, Gemini are not base models. They are base models that went through a second, sometimes a third training phase, and during it they learned to behave like assistants. That phase is called fine-tuning, and it’s what this final article of the series is about. We look at how a statistical text-completer becomes a model that follows instructions, writes code, answers questions, and refuses to provide bomb-building instructions — and at the end comes the candid question of what these methods actually solve and what they don’t.

What a base model does and why it isn’t an assistant

A base model has been trained on hundreds of billions of tokens scraped from the internet, with a single task: predict the next token. That task is fantastically rich — anyone who can predict the next token well must have learned grammar, semantics, world knowledge, and patterns of argumentation. But there is one thing the model didn’t learn during that: that humans want to talk to it in a question-answer format.

That becomes visible on the first direct attempt. We prompt a Llama 3 base model with:

What is the capital of France?

The most likely continuation in the training corpus isn’t “Paris” but rather something like:

What is the capital of France? What is the largest city in
France? What language is spoken in France? Here are the answers
to the most common questions about France:

The base model knows the answer. It just doesn’t retrieve it, because that’s not the most likely behavior given its training data. Quiz questions with answers exist on the web, but plenty of quiz questions exist without answers, followed by more quiz questions. The model has no preference for being helpful.

This is exactly where fine-tuning comes in. We take the base model with all its knowledge and adjust it so it responds in specific patterns — question-answer, instruction-execution, conversation-continuation. We’re not teaching it new knowledge but new behavior.

Supervised Fine-Tuning, the first step

The simplest form of fine-tuning is called Supervised Fine-Tuning (SFT). We take the base model and continue training it, but now no longer on random web text but on curated question-answer pairs:

USER: What is the capital of France?
ASSISTANT: Paris.

The training procedure is mechanically the same as during pretraining — we compute the cross-entropy between predicted and actual tokens and propagate the gradients back. What’s different is the data selection plus one technical detail: the loss is computed only over the response tokens, not the question. The model should learn to produce answers, not to repeat questions.

In schematic code:

def sft_loss(model, prompt_tokens, response_tokens):
    full_input = concat(prompt_tokens, response_tokens)
    logits = model(full_input)

    # Loss only over response tokens
    response_logits = logits[len(prompt_tokens):]
    response_targets = response_tokens
    return cross_entropy(response_logits, response_targets)

The data for this comes from several sources. Manually curated datasets like Dolly or OpenAssistant, with answers written by humans to real questions. Synthetic datasets, generated by larger models — that’s the standard route today, because a human annotation team for a hundred thousand examples gets expensive. Domain-specific datasets when the model is to be specialized in medicine, law, or code.

What’s surprising about SFT is how little data is enough. Even with a few thousand to a few tens of thousands of high-quality examples, the model’s behavior changes dramatically. The statistical text-completer becomes something that recognizes instructions and follows them. That speaks for the hypothesis that the knowledge is already there and SFT mostly shifts access to it rather than installing new information.

Chat templates and the role system

Before we move on, a practical detail that’s often underestimated: the chat template. A model trained on question-answer pairs has to know where a question begins and where an answer begins. That happens through special tokens that mark the roles.

Llama 3, for example, uses a format like:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Paris.<|eot_id|>

Three roles — system, user, assistant — separated by special tokens. The system prompt comes at the beginning and describes the assistant’s character. The alternating user/assistant blocks make up the conversation.

These tokens are not innate to the model. They get associated with their role during SFT because they appear consistently in all training examples. The model learns: after <|start_header_id|>assistant<|end_header_id|> comes a helpful response, not another question.

Different models use different templates. ChatML (introduced by OpenAI) uses <|im_start|> and <|im_end|>. Mistral uses [INST] and [/INST]. Anyone mixing models — say, when self-hosting locally with Ollama or llama.cpp — has to be careful to use the correct template for the respective model. Wrong template, wrong behavior.

Why SFT alone isn’t enough

With SFT alone, you could actually stop. The model now follows instructions, answers questions, writes code on demand. Wouldn’t that be enough?

It’s not. Three problems remain.

First: SFT teaches the model to look plausible, not to be good. When a training example says “for question X, answer Y,” the model learns to produce Y. But out of thousands of possible answers to a question, Y is only one. The model learns one Y, not an optimal Y. Subtle quality differences between good and very good answers get lost.

Second: SFT doesn’t model what’s undesired. If I teach the model to explain bomb-building instructions, it learns that. If I don’t teach it, it doesn’t learn it — but it still knows the information from pretraining and may produce it under the right prompt conditions. SFT shows the model what to do, but it doesn’t show it what not to do.

Third: SFT trains on one answer per question, but humans often compare answers against each other. “Which of these two answers is better?” is a much richer signal than “this answer is the right one.” SFT throws away that comparative character.

This is where the second phase comes in: preference learning.

RLHF, the classic path

RLHF stands for Reinforcement Learning from Human Feedback. The method was made public by OpenAI in 2022 for InstructGPT (Ouyang et al., 2022) and is the birth of ChatGPT. It has three phases.

Phase 1: SFT — as described above, the model is trained on question-answer pairs.

Phase 2: Reward Model. We take the SFT model and have it generate multiple answers to many different prompts — typically two to four per prompt. Human annotators then rank these answers against each other: which is better, which is worse. From this preference data we train a second model, the reward model. It takes a (prompt, answer) pair and returns a number: how good is this answer?

The reward model isn’t a separate architecture but typically the same transformer architecture as the language model, just with a different output head — instead of a probability distribution over tokens, there’s a single scalar output.

Phase 3: RL update. Now comes the actual reinforcement learning. We let the SFT model generate answers to prompts, have the reward model score each answer, and adjust the language model’s weights so that the reward values rise. That happens with the PPO algorithm (Proximal Policy Optimization), a standard method from RL.

Critical here: a KL divergence term that keeps the model from drifting too far from the SFT model. Without that term, the model would drift into strange strategies that produce high reward values but are content-wise nonsense — a classic case of reward hacking.

Schematically:

def rlhf_objective(model, ref_model, reward_model, prompts):
    responses = model.generate(prompts)
    rewards = reward_model(prompts, responses)
    kl = kl_divergence(model.logprobs(responses),
                       ref_model.logprobs(responses))
    return rewards.mean() - beta * kl.mean()

RLHF pipeline with three phases: SFT trains a base model on question-answer pairs, then the SFT model generates multiple responses that humans rank to train a reward model, and finally a PPO loop runs in which policy, reference, and reward model coexist in memory and the policy is optimized under a KL constraint

With this pipeline, ChatGPT was made out of GPT-3. With variants of it, Llama-2-Chat was made out of Llama 1, Claude 2 out of Claude 1, every current frontier model into the assistant we know.

The price of RLHF

RLHF works. It is also expensive and complicated.

Data annotation is the largest cost. For a frontier model you need 50,000 to 100,000 preference pairs, each with two or more answers that a human actually read and ranked. At 5–10 minutes per pair that’s hundreds of thousands of human hours. OpenAI, Anthropic, and Meta have dedicated annotation teams for this, often globally distributed.

The training infrastructure is non-trivial. PPO with three models simultaneously in memory (policy, reference, reward), all of it for 70-billion-parameter models, is an exercise in GPU engineering. Research groups with smaller budgets cannot reproduce RLHF.

The stability is notorious. PPO is hyperparameter-sensitive, the KL term has to be tuned, annotation quality varies. Training runs fail without clear reasons. Anthropic researcher John Schulman has said publicly that RLHF is more art than science — and Schulman co-invented PPO.

From this came the question: can it be done more simply?

DPO, the pragmatic step

The answer arrived in mid-2023 from a Stanford group around Rafael Rafailov. The paper was titled “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” (Rafailov et al., 2023) and it turned the field upside down.

The core idea of DPO is mathematically elegant. The researchers showed that the optimization problem of RLHF — find a policy that maximizes the reward model under a KL divergence constraint — can be reformulated. Instead of first training a reward model and then optimizing with PPO, you can train directly on the preference data with a loss that’s mathematically equivalent:

L_DPO = -log σ(β · log(π_θ(y_w|x) / π_ref(y_w|x))
              - β · log(π_θ(y_l|x) / π_ref(y_l|x)))

Where y_w is the chosen (winning) answer, y_l the rejected (losing), π_θ the current model, and π_ref the SFT reference model. β is a single hyperparameter that controls the strength of the KL constraint.

What this means in practice: no more reward model. No three models simultaneously in memory. No PPO with its stability problems. Just a training loop that looks like SFT, only with a slightly different loss.

Side-by-side comparison: on the left the RLHF stack with base model, SFT, reward model, and PPO loop, with three models in memory simultaneously; on the right the DPO stack with base model, SFT, and DPO loss directly on policy plus reference, without a reward model

DPO is today’s pragmatic standard. Llama 3 uses a DPO variant. Most open-source fine-tunes on HuggingFace use DPO. The implementation in TRL (HuggingFace’s RL library) is a few dozen lines of Python.

One property of DPO that gets praised a lot in practice: reproducibility. While PPO training runs can end differently depending on random seed, DPO runs are remarkably stable. That’s not a theoretical detail — it means a small research group can reproduce a DPO run and build on it, which often wasn’t possible with PPO.

Newer variants: KTO, ORPO, RLAIF

DPO is not the end of the line. Over the past two years several refinements have appeared, each addressing a specific problem.

KTO (Kahneman-Tversky Optimization, 2024) addresses a practical weakness of DPO: DPO needs preference pairs — two answers to the same question, a comparison. But in practice you often only have individual annotations: “this answer is good” or “this answer is bad.” KTO uses these binary labels directly, without pairing. Named after Kahneman and Tversky because the loss function is inspired by their prospect theory — losses get weighted more heavily than gains, which empirically calibrates better.

ORPO (Odds Ratio Preference Optimization, 2024) combines SFT and preference learning into a single training run. Classically SFT runs first, then DPO or PPO. ORPO does both at once with a combined loss. Saves training time, simplifies the pipeline further.

SimPO (2024) is DPO without a reference model. Instead of measuring KL divergence to a fixed reference model, the average token likelihood is normalized. Performs comparably well to DPO in many settings and needs one component fewer.

RLAIF (Reinforcement Learning from AI Feedback) replaces the human annotators with a larger language model. Instead of asking humans which answer is better, we ask GPT-4 or Claude. That scales much better — a million comparisons take days instead of months — but it inherits the biases of the rating model. Anthropic’s Constitutional AI is the most famous variant of this.

Constitutional AI, a different approach

While the DPO family simplifies the mechanics of RLHF, Constitutional AI (CAI) goes in a different direction: same RL mechanics, but with a fundamentally different source of preference.

Anthropic described CAI in a 2022 paper (Bai et al., 2022) and has refined it in every Claude generation since. The basic idea is a constitution — a list of principles like “respond helpfully,” “refuse harmful content,” “be honest about uncertainty,” “respect user autonomy.” These principles are formulated in natural language, not as a rule system.

In training a self-critique loop runs. The model generates an answer, then gets asked itself: “Does this answer violate principle X from the constitution? If so, rewrite it.” From these self-critique pairs the preference data emerges — the criticized answer as “losing,” the revised one as “winning.” With this data RL is then run, classically via PPO or with DPO variants.

The charm of CAI is its scalability. A constitution with 30 to 50 principles is humanly writable. The annotation is taken over by the model itself. That saves the most expensive human annotation hours.

The price: the model is trained to hold itself to its own standards. If the model has systematic blind spots — and every model does — they don’t get caught in the CAI loop, they get reinforced.

Which method current frontier models use is mostly a mix. Claude uses a combination of CAI for safety and classic RLHF for helpfulness. Llama 3 mixes SFT with DPO and rejection sampling. GPT-4 and its successors aren’t publicly documented, but public clues suggest a combination of RLHF and RLAIF with a partly self-generated data stack.

The complete pipeline

Let’s put it all together. What happens, from a raw internet corpus to ChatGPT, Claude, or Llama 3?

Four-phase pipeline: pretraining on 1 to 15 trillion tokens (weeks on thousands of GPUs), supervised fine-tuning on 10k to 1M question-answer pairs (hours to days), preference learning with DPO or RLHF on 50k to 1M preference pairs (days), and continuous deployment with quantization and serving. Compute share: pretraining 85 percent, SFT 3, preference learning 7, deployment 5

Phase	What happens	Data volume	Effort
Pretraining	Next-token prediction on raw web text	1–15 trillion tokens	Weeks to months on thousands of GPUs
SFT	Question-answer pairs, chat format	10,000 – 1 million examples	Hours to days
Preference Learning	DPO/RLHF on ranked answer pairs	50,000 – 1 million pairs	Days
Iteration	Multiple SFT/preference rounds, sometimes with synthetic data	varies	Weeks to months
Red Teaming	Adversarial tests, safety prompts	Thousands of curated edge cases	parallel to iteration
Deployment	Quantization, serving, A/B tests	—	continuous

The first steps are the most expensive — pretraining accounts for 70 to 95 percent of the compute cost depending on the model. But the behavior we perceive as users gets shaped in the later phases. A Llama 3 base model and Llama 3 Instruct have the same weights plus or minus a few percent — but they behave completely differently.

That’s also the reason why “open weights” in the industry today often means: the base model is open, the fine-tuning recipe usually isn’t. Meta releases Llama weights after both phases but doesn’t document the exact SFT and DPO setup. Mistral, Qwen, DeepSeek, and others follow the same pattern. Anyone wanting to reproduce an open-weights model from scratch gets the pretraining immediately, the post-training only approximately.

What alignment doesn’t solve

At this point the article would be neatly wrapped up. Pretraining, SFT, preference learning, done — we made an assistant out of the base model. In reality the question of what we did is harder than it looks.

The methods in this article are often summarized under the term alignment — bringing the model into line with human values. But what they technically do is more narrowly framed: they shift the probability distribution over answers. What was rated good in training becomes more probable. What was rated bad becomes less probable. That’s behavior shaping, not values alignment in the philosophical sense.

Three problems illustrate the difference.

Sycophancy is probably the best-documented failure mode. Models trained with RLHF often learn to agree with their conversation partners — even when the partner is factually wrong. That happens because human annotators react positively to answers that match their position. The model learns: agree, then you get good ratings. That’s not “aligned with truth,” that’s “aligned with annotator bias.”

Goodhart’s Law kicks in. When a measure becomes a target, it ceases to be a good measure. When we train the model to achieve high reward values, it optimizes the reward — not the underlying property the reward model was supposed to measure. That’s the theoretical reason so much attention is directed at the robustness of reward models, and why the problem isn’t solved by better data alone.

Distributional shift between training and use. The model is trained on a particular distribution of prompts — curated, selected examples, often English, often in an academic or business-oriented register. In use, prompts come in dozens of languages, with typos, with unusual concerns, with adversarial attempts to manipulate the model. How the model behaves under distributional shift is an empirical question that can’t be derived from training.

Anthropic, OpenAI, and the other frontier labs know this. The industry’s response is multi-layered: continuous red-teaming, broader and more diverse annotator pools, constitutional methods to shift the source of bias, interpretability research to understand what the model has actually learned (see the section on mechanistic interpretability in Article 7), and a growing research field labeled “AI safety” that addresses precisely these questions.

What that means practically: the models we work with today are much more helpful and safer than a raw base model, and that’s a huge step forward. But they are not “aligned” in a strong sense. They are behaviorally shaped. The difference becomes important as soon as the models operate in agentic setups — acting independently in browsers, file systems, mail accounts, code repositories. There the weak spots of behavior shaping become painfully visible.

What we take with us

Eight articles, one foundation. From the single token to the production-ready assistant. We’ve seen:

Tokens and language models (Part 1) — how language becomes mathematics in the first place, and how the trick “predict the next word” carries the entire edifice.
Embeddings (Part 2) — how words become vectors, and why semantic similarity becomes spatial proximity.
Neural networks (Part 3) — how linear algebra plus activation functions yields a universal function approximator.
Backpropagation (Part 4) — how a model learns by propagating errors backward and adjusting weights.
Context and RNNs (Part 5) — why order matters and why the old answer didn’t scale.
Attention (Part 6) — how a single mechanism changed everything by directly connecting every token to every other.
The transformer (Part 7) — how attention, position, depth, and stability became the architecture that has formed the basis of every large language model since 2017.
Fine-tuning (this article) — how a base model becomes an assistant, and why “alignment” is probably not the most honest label for what happens.

What we haven’t covered in detail: the hardware reality (GPU clusters, distributed training, mixed precision, Flash Attention), the inference optimizations (quantization, speculative decoding, KV cache management), multimodality (vision transformers, audio tokens, video), the agentic extensions (tool use, memory, MCP), and the open research questions (reasoning, long context, continual learning). Each of these is its own series.

What remains is an insight that doesn’t sit in any one article alone. These models are more remarkable than many acknowledge, and they are less magical than many fear. They are the result of a remarkably pragmatic bundle of statistics, linear algebra, lots of compute, lots of data, and a few good ideas. They are not really understood — mechanistic interpretability research is just scratching the surface, and for current frontier models the map is almost empty. But they are more understood than the headlines suggest, and no one who has read the eight articles of this series should still feel that LLMs are a black box.

They aren’t. They are a very large, very well trained, very useful mixing machine. What we do with it is the next question — and it won’t be answered in one article, but in the coming years by all of us.

Translated with the help of Claude.

All articles in the series

The next word, how language models work
Words as points in space, what embeddings really are
Neural networks from scratch
Backpropagation, how a model learns
Context and RNNs, why order matters
Attention Is All You Need
The transformer, the complete architecture
Fine-tuning, from base model to assistant ← this article

Series: How LLMs Work · rotecodefraktion.de