Backpropagation — How a Model Learns

Article 4 of 8 · Series: How LLMs Work

At the end of Article 3 we had a complete neural network — embedding in, logits out, every step built from scratch. But the network couldn’t actually do anything. The output probabilities were nearly uniform because the weights were initialized randomly. The network had never seen anything, never learned anything.

In this article, we close that gap. We look at how a model learns — not metaphorically, but mechanically. Backpropagation is the algorithm that turns a randomly initialized network into a trained one. The idea is surprisingly simple. The implementation requires a bit of calculus, but the intuition is accessible without a math degree.

What Does “Learning” Actually Mean?

A model has weights. These weights determine what it outputs. Learning means: change the weights so that the outputs get better.

“Better” has to be defined precisely. If we show a model that “The capital of France is” should be followed by the token " Paris", and the model says " Berlin", then the output is wrong. How wrong exactly? That’s what the loss measures — a number that tells us how far the prediction is from the truth. The smaller the loss, the better the model.

Learning is therefore an optimization problem: find the weights that minimize the loss. For a mini-MLP with 19,210 parameters, that’s already a 19,210-dimensional problem. For Llama 3.1 with 405 billion parameters, it’s a 405-billion-dimensional one. Brute force — trying all combinations — is out of the question immediately. We need a smart approach.

The smart approach is called gradient descent.

The Loss: A Measure of Wrongness

Before we can optimize, we need the objective function. For language models, this is typically the cross-entropy loss — it measures how far one probability distribution is from another.

Quick Detour: One-Hot Encoding

Before we get to the loss, we need a way to represent the “correct answer” in a form the model can work with. The model doesn’t output words, it outputs a probability distribution over all tokens in the vocabulary. To compare its output against the correct answer, the correct answer must also be a probability distribution.

One-hot encoding is the simplest option. For a vocabulary of 4 tokens, we write each token as a vector with four entries: a 1 at the position of the token, 0 everywhere else.

Vocabulary: [" Paris", " Berlin", " London", " Madrid"]
             Index 0     Index 1     Index 2     Index 3

" Paris"  → [1, 0, 0, 0]
" Berlin" → [0, 1, 0, 0]
" London" → [0, 0, 1, 0]
" Madrid" → [0, 0, 0, 1]

The vector is called “one-hot” because exactly one position is “hot” (has value 1), the rest is “cold” (0). Read as a probability distribution, [1, 0, 0, 0] means: “100% Paris, 0% everything else” — a distribution that is absolutely certain Paris is the correct answer.

That’s exactly what we need for the comparison: the model outputs a distribution (e.g. [0.1, 0.7, 0.15, 0.05]), we have the true answer as a distribution ([1, 0, 0, 0]), and the loss measures the difference between them.

Cross-Entropy by Example

Let’s say the model is supposed to predict the next token. The correct answer is " Paris", so as one-hot:

y_true = [1, 0, 0, 0]   # 100% Paris, 0% everything else

But the model says:

y_model = [0.1, 0.7, 0.15, 0.05]   # 70% Berlin, only 10% Paris

The cross-entropy loss for this prediction is -log(0.1) ≈ 2.30. If the model had predicted " Paris" at 90%, the loss would be -log(0.9) ≈ 0.11. At 100% it would be -log(1.0) = 0. The more confident the model is about the correct answer, the smaller the loss.

import numpy as np

def cross_entropy(y_pred, y_true_index):
    """Cross-entropy for a single prediction.
    y_pred: probabilities from the model, e.g. [0.1, 0.7, 0.15, 0.05]
    y_true_index: index of the correct token, e.g. 0 for " Paris"
    """
    return -np.log(y_pred[y_true_index] + 1e-12)   # epsilon guards against log(0)

# Example: model predicts strongly wrong token
probs = np.array([0.1, 0.7, 0.15, 0.05])
print(f"Loss for wrong prediction:    {cross_entropy(probs, 0):.4f}")

# Example: model is uncertain
probs = np.array([0.3, 0.3, 0.2, 0.2])
print(f"Loss for uncertain model:     {cross_entropy(probs, 0):.4f}")

# Example: model predicts correctly with high confidence
probs = np.array([0.9, 0.05, 0.03, 0.02])
print(f"Loss for correct prediction:  {cross_entropy(probs, 0):.4f}")

Loss for wrong prediction:    2.3026
Loss for uncertain model:     1.2040
Loss for correct prediction:  0.1054

The loss is a single number. That’s the crucial trick. No matter how complex the model, no matter how many weights it has — at the end of the forward pass a single number comes out that says “this is how wrong you were just now”. This one number is what we want to minimize.

∑ Why -log and not just 1 minus probability? ▸

Cross-entropy isn’t chosen arbitrarily. It comes from information theory: -log(p) measures the “surprise” when an event with probability p occurs. The less likely the model considers the correct answer, the more surprising it is that the answer was right, the bigger the loss.

Formally, the cross-entropy loss between true distribution y and predicted distribution ŷ is:

L = -Σ yᵢ · log(ŷᵢ)

With one-hot encoding, only one yᵢ equals 1, all others are 0. The sum reduces to the single term where yᵢ=1, i.e. -log(ŷ_correct).

Why is this better than 1 - ŷ_correct? Two reasons:

Gradient behavior: The derivative of -log(p) is -1/p. When the model is completely wrong (p small), the gradient is huge and pulls strongly in the right direction. 1-p would have the same gradient everywhere, regardless of how wrong the model is.
Combines elegantly with softmax: The derivative of softmax-followed-by-cross-entropy is simply ŷ - y — one of the simplest derivatives there is. This is no coincidence but the reason this combination is used in practically all classification networks.

The Intuition Behind Gradients

We have a loss function. We have many weights. We want to change the weights so that the loss decreases. How?

Let’s forget neural networks for a moment and imagine a single weight — call it w. The loss depends on w: L(w). When we change w, L changes. If we plot this as a curve, with w on the x-axis and L on the y-axis, there’s a minimum somewhere. That’s the weight we want to find.

But we can’t see the whole curve. We’re standing at one point and only know: the loss at this location is such-and-such. How do we get to the minimum?

The answer is the derivative. The derivative of L with respect to w tells us: if I increase w a tiny bit, how does L change? If the derivative is positive, L gets bigger — we need to decrease w to reduce the loss. If negative, we need to increase w. If zero, we’re at the minimum.

For many weights we speak of the gradient — a vector that tells us, for each weight individually, in which direction it needs to change. The gradient points in the direction of steepest ascent of the loss. We move in the opposite direction, because we want to decrease the loss. Hence “gradient descent” — descent along the gradient.

A crucial parameter is the learning rate. It determines how big the step is that we take in the descent direction. Too small: we never get there, training takes forever. Too large: we overshoot the minimum and start oscillating or diverging. Balancing the learning rate is one of the central questions in training neural networks.

Gradient Descent: The Algorithm

In code, gradient descent is three lines:

gradient = compute_gradient(w, data)     # How steep and in which direction?
w = w - learning_rate * gradient         # One step downhill

Repeat this often enough and you land in a minimum. That, at its core, is all training is.

The only problem: how do you compute compute_gradient(w, data) for a network with thousands of layers and billions of weights? You can’t test all weights individually numerically — that would be absurdly expensive. You need a way to compute the gradient analytically.

The answer is the chain rule.

The Chain Rule: Gradients Through Many Layers

A neural network is a chain of operations: Embedding → Layer 1 → Activation → Layer 2 → Softmax → Loss. Each of these operations is a function. The whole network is therefore a nested function, something like L = f(g(h(embedding))).

The chain rule from calculus says: if I want to know how strongly the loss depends on a weight buried deep in the network, I multiply the local gradients along the path.

“Local” means: for each individual operation, you can easily compute how its output changes when its input changes. The chain rule says: just multiply these local slopes together, and you get the overall slope through the entire chain.

Wait — what exactly is the total gradient?

Playing with the widget, you notice: the output and the total gradient are two different numbers that don’t obviously have anything to do with each other. At x = 1.3 and a = 2.0, the output is 6.76, but the total gradient is 10.4. What does that 10.4 mean?

The total gradient is not the output. It answers a completely different question: “If I change x by a tiny amount, how strongly does the output change?” The answer at x = 1.3 is: the output changes 10.4 times as much as the change in x.

Let’s check concretely:

x	f(x) = 2·x	g(f(x)) = f(x)²
1.30	2.60	6.760
1.31	2.62	6.864

Change in x: +0.01. Change in output: +0.104. Ratio: 0.104 / 0.01 = 10.4 — exactly the total gradient.

So the gradient tells us not the output, but the sensitivity of the output to x. That’s exactly what we need for training: “in which direction and how strongly do I need to push a weight so the loss decreases?” The answer is in the gradient.

Careful: the gradient is only locally valid

One stumbling block: you might think you could compute the gradient once and then predict any arbitrary output — just Output + gradient · Δx. That only works for tiny changes though.

For our example with x = 1.3, output 6.76, gradient 10.4:

Tiny step: x = 1.31 → predicted 6.76 + 0.01·10.4 = 6.864 → actual 6.8644 ✓ fits almost exactly.
Medium step: x = 1.5 → predicted 6.76 + 0.2·10.4 = 8.84 → actual 9.00 — already a bit off.
Big step: x = 3.0 → predicted 6.76 + 1.7·10.4 = 24.44 → actual 36.00 — clearly off.

Why? The function is curved. At every point, the slope is different. The gradient at x = 1.3 is a snapshot right there — it knows nothing about the curvature elsewhere. At x = 3.0 the actual gradient would be 2·2·(2·3) = 24, not 10.4 anymore.

The analogy: we’re standing on a hill and measuring the slope under our feet. That tells us in which direction it currently goes up or down — but not where the valley is. A small step in the descent direction brings us closer. A large leap, however, lands us in a different landscape where the slope is completely different.

This is exactly why gradient descent is iterative: local gradient, small step, measure new gradient, next step. Thousands of times. That’s also why the learning rate is so tricky — if the step is too big, we leave the region where the gradient was still a good approximation (in the gradient-descent widget above you can see exactly this: at learning rate > 1, the ball overshoots the minimum).

Chain rule as gears

Each gear has its own local ratio. The total ratio through the chain is simply the product. That's the chain rule.

An intuitive comparison: imagine a chain of gears. The first gear turns once, the second turns 3× as fast, the third 2× as fast as the second. How fast does the third turn when the first turns once? 1 × 3 × 2 = 6. That’s the chain rule.

In a neural network exactly that happens — only with many gears and in many dimensions simultaneously.

∑ The chain rule formally ▸

The notation from calculus looks more complicated than it is. dL/dx means: “how strongly does L change when I change x by a tiny bit?” It’s simply a number — the slope of the function L at the point x.

For a nested function L = f(g(h(x))) it holds that:

dL/dx = (dL/df) · (df/dg) · (dg/dh) · (dh/dx)

In words: The total slope of L with respect to x is the product of four local slopes:

dh/dx: how strongly does h change when x changes?
dg/dh: how strongly does g change when its input h changes?
df/dg: how strongly does f change when its input g changes?
dL/df: how strongly does L change when f changes?

Exactly the gear picture: each gear has a local ratio, and the total ratio is the product. Each factor is a local gradient — a quantity that’s easy to compute individually. To get dL/dx, you multiply these local gradients along the path from L back to x.

In a neural network this corresponds to a backward traversal: from the loss back through each layer, with each layer contributing its own local gradient. That’s exactly why the algorithm is called backpropagation — it propagates the gradient backward through the network.

For multidimensional functions (which networks almost always are), the derivatives become Jacobian matrices, but the principle remains: multiply local gradients. In implementation, these are matrix multiplications instead of number multiplications.

Backpropagation, Implemented by Hand

Enough theory. We implement backpropagation for a mini network that learns XOR — the problem from Article 3. Four data points, two inputs, one output. It doesn’t get smaller than this, and it demonstrates everything that also happens in Llama 3.1.

A quick word on the activation function

In Article 3 we had ReLU and GELU. For this example we take a third: Sigmoid. Sigmoid squashes any input value into the range between 0 and 1 — large positive numbers become almost 1, large negatives become almost 0, and 0 lands at 0.5. It’s a historically important activation function (it was the standard before ReLU) and has two practical advantages for our XOR example: its output is directly usable as binary classification (above 0.5 = class 1, below 0.5 = class 0), and its derivative can be computed particularly easily from its own output. We’ll see what that means in the code.

Sigmoid function

Sigmoid squashes any input to the range (0, 1). Near the ends, the curve is almost flat — gradients nearly zero.

Weight initialization

Before training starts, the weights need to be set to some values. One obvious idea would be: all to zero. That doesn’t work — if all weights are identical, all neurons in a layer get exactly the same gradient and learn the same thing. The neurons never become distinguishable, and the network behaves like a single neuron.

That’s why we initialize with random numbers. Each weight gets a small random value. This way every neuron starts at a slightly different position, and backpropagation can modify them differently. The factor * 0.5 in the code below scales the random values to a small range — typical values are between -0.5 and +0.5. Too-large starting values cause sigmoid to enter regions where its derivative is near 0 (at the outer ends), and training stalls.

Weights and biases — two kinds of parameters

So far we’ve only talked about weights. But neural networks have a second kind of parameter that’s just as important: biases. Both are learned during training, both get their own gradients during the backward pass.

The computation for a layer is z = x @ W + b. The weights part x @ W connects the neurons — for each connection between an input neuron and a neuron in the next layer there’s one weight. The bias b is a shift per neuron, independent of the input. Each neuron in the receiving layer has its own bias.

Why do we need biases? Without biases, every layer would have to pass through the origin — with input 0 the output before activation would also always be 0. The bias allows the neuron to say: “I only activate when the sum is greater than X”. This makes the network more flexible.

For our XOR network, let’s add up the parameters:

W1 (2 inputs × 4 hidden): 8 weights
b1 (one bias per hidden neuron): 4 biases
W2 (4 hidden × 1 output): 4 weights
b2 (one bias for the output neuron): 1 bias

Total: 8 + 4 + 4 + 1 = 17 parameters. All 17 are learned during training.

In the code we’ll see this as four separate arrays: W1, b1, W2, b2. The biases are initialized to 0 with np.zeros(...) (this is unproblematic since the weights are already random — all neurons start out differently anyway).

Network architecture: 2 → 4 → 1

2 inputs, 4 hidden neurons, 1 output. W1 (2×4 = 8 weights) and W2 (4×1 = 4 weights), plus 4+1 = 5 biases — 17 parameters in total.

import numpy as np

def sigmoid(x):
    # Clip prevents overflow for very large negative values
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_deriv(a):
    # The neat thing about sigmoid: the derivative can be computed from the output a itself.
    # If a = sigmoid(z), then sigmoid'(z) = a * (1 - a).
    # This saves us the detour via z in the backward pass.
    return a * (1 - a)

# XOR dataset: the four possible binary inputs and their target outputs
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
y = np.array([[0], [1], [1], [0]], dtype=float)

# Network architecture: 2 inputs -> 4 hidden neurons -> 1 output
# We set a fixed seed so the experiment is reproducible.
np.random.seed(42)

# W1 connects the 2 inputs to 4 hidden neurons -> matrix with shape (2, 4).
# randn() returns random numbers from a normal distribution centered on 0.
# The factor 0.5 scales them to a small range so sigmoid doesn't immediately
# land in the flat ends.
W1 = np.random.randn(2, 4) * 0.5
b1 = np.zeros((1, 4))   # Bias per neuron, starts at 0

# W2 connects the 4 hidden neurons to the single output -> shape (4, 1).
W2 = np.random.randn(4, 1) * 0.5
b2 = np.zeros((1, 1))

# Learning rate: how big are the steps in the weight update?
learning_rate = 0.5
losses = []   # We collect the loss per epoch for later analysis

for epoch in range(5000):
    # ───── Forward pass: send input through the network ─────
    z1 = X @ W1 + b1        # Linear combination in the hidden layer
    a1 = sigmoid(z1)         # Activation: a1 is the output of the hidden layer
    z2 = a1 @ W2 + b2        # Linear combination in the output layer
    a2 = sigmoid(z2)         # Activation: a2 is the network's final prediction

    # Loss: how far is the prediction a2 from the target y?
    # Mean Squared Error (MSE) is simple for regression-like problems:
    # mean squared deviation over all 4 training examples.
    loss = np.mean((a2 - y) ** 2)
    losses.append(loss)

    # ───── Backward pass: chain rule backward through the network ─────
    # Each line here corresponds to one step of the chain rule.
    # We start from the back (at the loss) and work our way forward.

    # Gradient of the loss with respect to the output a2.
    # MSE derivative: d/da2 of (a2-y)² is 2*(a2-y). Divide by len(X) because of the mean.
    d_a2 = 2 * (a2 - y) / len(X)

    # Gradient before the sigmoid in the output layer.
    # Chain rule: dL/dz2 = dL/da2 * da2/dz2. The latter is sigmoid_deriv(a2).
    d_z2 = d_a2 * sigmoid_deriv(a2)

    # Gradient of the weights W2 (for the update).
    # Because z2 = a1 @ W2 + b2, dz2/dW2 equals a1.
    # In matrix form: a1.T @ d_z2.
    d_W2 = a1.T @ d_z2
    d_b2 = d_z2.sum(axis=0, keepdims=True)  # Bias gradient: just sum up

    # Gradient through W2 back into the hidden layer.
    # Chain rule: dL/da1 = dL/dz2 * dz2/da1 = d_z2 @ W2.T
    d_a1 = d_z2 @ W2.T

    # Gradient before the sigmoid in the hidden layer — same trick as in layer 2.
    d_z1 = d_a1 * sigmoid_deriv(a1)

    # Gradient of the weights W1 and bias b1.
    d_W1 = X.T @ d_z1
    d_b1 = d_z1.sum(axis=0, keepdims=True)

    # ───── Update: push weights in the direction of steepest descent ─────
    # This is gradient descent: w = w - lr * gradient
    W1 -= learning_rate * d_W1
    b1 -= learning_rate * d_b1
    W2 -= learning_rate * d_W2
    b2 -= learning_rate * d_b2

    # Print progress
    if epoch % 1000 == 0:
        preds = (a2 > 0.5).astype(int).flatten()
        correct = (preds == y.flatten()).sum()
        print(f"Epoch {epoch:4d}  Loss = {loss:.4f}  Correct = {correct}/4")

print(f"\nFinal predictions:")
for inp, target, pred in zip(X, y.flatten(), a2.flatten()):
    print(f"  {inp} → {pred:.3f}  (Target: {int(target)})")

Epoch    0  Loss = 0.2611  Correct = 2/4
Epoch 1000  Loss = 0.1891  Correct = 2/4
Epoch 2000  Loss = 0.0432  Correct = 4/4
Epoch 3000  Loss = 0.0084  Correct = 4/4
Epoch 4000  Loss = 0.0041  Correct = 4/4

Final predictions:
  [0. 0.] → 0.061  (Target: 0)
  [0. 1.] → 0.940  (Target: 1)
  [1. 0.] → 0.942  (Target: 1)
  [1. 1.] → 0.063  (Target: 0)

The network has learned XOR. It starts at loss 0.26, sits on a kind of plateau for about 1500 epochs (it doesn’t know where to go), then suddenly tips toward the solution. After 2000 epochs it correctly classifies all four points, after 5000 the predictions are nicely at almost 0 or almost 1.

Let’s look at the backward pass closely. That’s the entire backpropagation algorithm:

We start at the back at the loss and compute the gradient with respect to the output.
We go backward through each layer. At each layer: we use the gradient of the next layer and the local gradient of the current layer to compute the gradient for the weights.
We update all weights with gradient descent.

That’s it. No magic. Just matrix multiplications and derivatives that we assemble from the chain.

∑ Where do the derivatives in the code come from? ▸

Each line in the backward pass corresponds to exactly one chain-rule step. Let’s go through them:

d_a2 = 2 * (a2 - y) / len(X) That’s dL/da2. Mean Squared Error is L = Σ(a2-y)²/n. Derivative with respect to a2: 2(a2-y)/n.

d_z2 = d_a2 * sigmoid_deriv(a2) That’s dL/dz2 = dL/da2 · da2/dz2. Because a2 = sigmoid(z2), da2/dz2 = sigmoid'(z2) = a2·(1-a2). Chain rule.

d_W2 = a1.T @ d_z2 That’s dL/dW2. Because z2 = a1 · W2 + b2, the derivative with respect to W2 is simply a1. In matrix form: a1.T @ d_z2.

d_a1 = d_z2 @ W2.T Gradient goes backward through the linear layer. Because z2 = a1 · W2, dz2/da1 = W2. So dL/da1 = dL/dz2 · W2.T.

d_z1 = d_a1 * sigmoid_deriv(a1) Chain rule through the sigmoid activation again, just like in layer 2.

d_W1 = X.T @ d_z1 Gradient of the weights in layer 1 — same logic as for d_W2, just with X instead of a1 as input.

Each line is one local derivative, times the gradient that has already flowed back from the later layers. That’s exactly the chain rule.

From XOR to Token Prediction

XOR is a clean demo, but artificial. A language model doesn’t learn XOR, it learns token predictions. The principle is identical, just bigger and with cross-entropy instead of MSE.

We train a tiny MLP to produce the right output token given an input token. Our vocabulary consists of five tokens: ["The", "capital", "of", "France", "Paris"]. We show the model four example pairs:

"The"         → "capital"
"capital"     → "of"
"of"          → "France"
"France"      → "Paris"

The model should learn these associations. The setup is the same as in Article 3, just with backpropagation attached now. A few details are different and therefore commented:

One-hot encoding instead of embeddings: we represent each token as a vector with a 1 at exactly one position, 0 elsewhere. This is the simplest conceivable input and works for our 5 tokens. In real models, embedding vectors are used here — that doesn’t change anything about the training principle.
Softmax + cross-entropy instead of sigmoid + MSE: for multi-class classification this is the standard. The derivative of this combination is particularly elegant, as we’ll see in the code.

import numpy as np

# Our mini vocabulary and its size V.
# V = 5 means: at the end the model will output, for each input, a
# probability distribution over these 5 tokens.
VOCAB = ["The", "capital", "of", "France", "Paris"]
V = len(VOCAB)

# Training data as token IDs (the position in the vocabulary).
# X_ids are the inputs, y_ids the corresponding target outputs.
# Example: (0, 1) means "model should predict, given token 0 ('The'),
# the next token 1 ('capital')".
X_ids = np.array([0, 1, 2, 3])   # "The", "capital", "of", "France"
y_ids = np.array([1, 2, 3, 4])   # "capital", "of", "France", "Paris"

# One-hot encoding of the inputs. A token ID becomes a vector of length V,
# with a 1 at the position of the ID, 0 elsewhere.
# Trick: np.eye(V) is the V×V identity matrix (diagonal = 1, rest = 0).
# Its i-th row is already the one-hot vector for token i.
# [X_ids] selects the rows in the order of our token IDs.
# Result X has shape (4, 5) - 4 training examples, each a 5-dim vector.
X = np.eye(V)[X_ids]

def softmax(z):
    # Softmax turns arbitrary numbers (the "logits") into a
    # probability distribution - all values between 0 and 1,
    # and the sum over each row is exactly 1.
    #
    # Stability trick: if z contains very large values (e.g. 1000),
    # then np.exp(1000) would overflow and return Infinity.
    # Subtracting the maximum beforehand leaves the result mathematically
    # identical (it cancels out) but avoids the overflow.
    z = z - np.max(z, axis=1, keepdims=True)
    e = np.exp(z)
    # Division by the sum per row normalizes the values to probabilities.
    return e / e.sum(axis=1, keepdims=True)

def gelu(x):
    # GELU - the activation function from Article 3, standard in modern transformers.
    # We use it here to demonstrate that backpropagation works with any
    # differentiable activation, not just sigmoid.
    # The formula is a smooth approximation - you don't need to understand it,
    # just know: GELU behaves similarly to ReLU, but without the sharp kink at 0.
    return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))

def gelu_deriv(x):
    # For the chain rule in the backward pass we need the derivative of GELU.
    # For sigmoid it was analytically easy (a·(1-a)). For GELU it isn't.
    # We help ourselves with a *numerical* derivative: we simply compute
    #     (f(x+h) - f(x-h)) / (2h)
    # for a very small h. That's the definition of the derivative as a difference quotient.
    # For teaching purposes this is enough; in production code you'd take the analytical form
    # (or let frameworks like PyTorch do it automatically).
    h = 1e-4
    return (gelu(x + h) - gelu(x - h)) / (2 * h)

# Network architecture: V inputs (=5) -> 16 hidden neurons -> V outputs (=5).
# That input and output dimension are equal is due to our task:
# we take in a one-hot vector and want a probability distribution
# over the same vocabulary out.
# 16 hidden neurons is chosen arbitrarily - enough capacity for 4 training examples.
np.random.seed(0)
W1 = np.random.randn(V, 16) * 0.3   # Keep weights small, otherwise GELU starts in saturation
b1 = np.zeros((1, 16))
W2 = np.random.randn(16, V) * 0.3
b2 = np.zeros((1, V))

learning_rate = 0.3

for epoch in range(500):
    # ───── Forward pass ─────
    z1 = X @ W1 + b1           # Hidden layer: linear combination, shape (4, 16)
    a1 = gelu(z1)              # Hidden layer activation, shape (4, 16)
    z2 = a1 @ W2 + b2          # Output layer: linear combination, shape (4, 5)
                               # These raw numbers are called "logits" - they can
                               # take any value, including negative ones.
    probs = softmax(z2)        # Softmax turns logits into probabilities, shape (4, 5)

    # Cross-entropy loss: -log(probability the model assigns to the *correct* token).
    # We need, for each of the 4 rows in probs, exactly the value at column y_ids[i].
    #
    # probs[np.arange(len(y_ids)), y_ids] is NumPy "fancy indexing":
    #   - np.arange(4) = [0, 1, 2, 3] selects all 4 rows
    #   - y_ids        = [1, 2, 3, 4] selects for each row the matching column
    # Result: a vector with 4 probabilities - for each example the probability
    # the model assigned to the *correct* target token.
    #
    # The + 1e-12 avoids log(0) if the model outputs a probability of exactly 0.
    # .mean() takes the mean over the 4 training examples.
    loss = -np.log(probs[np.arange(len(y_ids)), y_ids] + 1e-12).mean()

    # ───── Backward pass ─────
    # The neat thing: the derivative of softmax + cross-entropy is extremely simple:
    # just (probs - y_onehot). This saves us all the intermediate steps
    # that would otherwise be necessary (the softmax derivative itself isn't that nice).
    # This is no coincidence but the reason softmax + cross-entropy is used everywhere
    # where classification is done.
    y_onehot = np.eye(V)[y_ids]          # y_ids as one-hot matrix, shape (4, 5)
    d_z2 = (probs - y_onehot) / len(X)   # Gradient at the output layer, shape (4, 5)

    # From here on identical logic as in the XOR example:
    d_W2 = a1.T @ d_z2                   # Gradient of weights W2
    d_b2 = d_z2.sum(axis=0, keepdims=True)

    # Gradient backward through W2 into the hidden layer
    d_a1 = d_z2 @ W2.T                   # Chain rule
    d_z1 = d_a1 * gelu_deriv(z1)         # Important: GELU derivative takes z1 as input
                                          # (not a1 - GELU is not, like sigmoid, 'computable
                                          #  from the output')
    d_W1 = X.T @ d_z1                    # Gradient of weights W1
    d_b1 = d_z1.sum(axis=0, keepdims=True)

    # Update all parameters - step in the descent direction of the loss
    W1 -= learning_rate * d_W1
    b1 -= learning_rate * d_b1
    W2 -= learning_rate * d_W2
    b2 -= learning_rate * d_b2

    if epoch % 100 == 0:
        preds = probs.argmax(axis=1)     # argmax: which column has the highest probability?
        correct = (preds == y_ids).sum()
        print(f"Epoch {epoch:3d}  Loss = {loss:.4f}  Correct = {correct}/4")

# Test the trained model with each of the first 4 tokens as input.
print("\nPredictions after training:")
for i, token in enumerate(VOCAB[:-1]):
    # np.eye(V)[i:i+1] - using slicing [i:i+1] preserves the row dimension (shape (1,5)),
    # whereas np.eye(V)[i] would return a 1D vector (shape (5,)).
    # But the matrix multiplication needs 2D shape as input.
    x = np.eye(V)[i:i+1]
    # Forward pass manually (no more backprop needed, just prediction):
    p = softmax(gelu(x @ W1 + b1) @ W2 + b2)[0]
    top = p.argmax()                      # Index of the most likely token
    print(f"  '{token}' → '{VOCAB[top]}'  ({p[top]*100:.1f}%)")

Epoch   0  Loss = 1.6196  Correct = 1/4
Epoch 100  Loss = 0.0158  Correct = 4/4
Epoch 200  Loss = 0.0064  Correct = 4/4
Epoch 300  Loss = 0.0039  Correct = 4/4
Epoch 400  Loss = 0.0028  Correct = 4/4

Predictions after training:
  'The' → 'capital'  (99.7%)
  'capital' → 'of'  (99.6%)
  'of' → 'France'  (99.4%)
  'France' → 'Paris'  (99.4%)

That’s, in miniature, exactly what a language model does. We take an input token, send it through a network, produce a probability distribution over all possible next tokens, compare with the actual next token from the training text, compute the loss, propagate backward, update the weights. And repeat — billions of times.

The similarity between the two examples is no coincidence. The backward pass looks almost identical, only the derivative at the end differs (MSE for XOR, cross-entropy for token prediction). That’s one of the reasons neural networks are so powerful: the same basic scaffolding works for completely different problems. You change input data, output dimension, and loss function — the machinery underneath stays identical.

What Else Comes in Practice

What we’ve built is vanilla gradient descent — the basic version. In real training setups, several refinements are added, all aimed at the same thing: getting to the minimum faster and more stably.

Mini-batches. Above we used all four data points together to compute one gradient. With real datasets of billions of tokens this doesn’t work — a batch would be too big for memory. Instead you compute the gradient on a small subset (e.g. 64 or 256 examples), do an update, take the next subset. That’s Stochastic Gradient Descent (SGD) or mini-batch SGD.

Momentum. Instead of using only the current gradient, you keep a moving average of recent gradients. This smooths training and helps cross narrow valleys.

Adam. The widely used optimizer today. Combines momentum with a per-parameter adaptive learning rate. Parameters that frequently see large gradients get smaller steps; parameters with rarely large gradients get bigger ones. This makes training robust even with very differently scaled parameters.

Learning rate schedules. The learning rate isn’t kept constant but varied according to a schedule — typically high at the start, then gradually decaying. You want to make fast progress early on, then fine-tune at the end.

None of this changes the basic principle. The core remains: compute forward pass, determine loss, backward pass for all gradients, update weights. What’s added are tricks to make this core more efficient.

What’s Still Missing

We now have everything to train a network. Embedding, forward pass, loss, backpropagation, update — the full loop is there. With this toolkit we could in principle build a language model.

But the model we’ve built so far has a fundamental problem: it only sees one token at a time. The input is one embedding vector, the output is a prediction for the next token. The context — all the tokens that came before — is not part of the input.

Language, however, works exactly through context. “I went to the bank” and “the bank of the river” mean different things, and the difference arises entirely from the surrounding tokens. A model that sees only one token at a time cannot capture such context dependencies.

The next question is therefore: how do you put context into the model? The first answer the deep-learning community came up with was recurrent neural networks (RNNs) — networks with an internal state that updates with each token. That’s where Article 5 begins.

All Articles in the Series

The Next Word — How Language Models Work
Words as Points in Space — What Embeddings Really Are
Neural Networks from Scratch
Backpropagation — How a Model Learns ← this article
Context and RNNs — Why Order Matters (coming soon)
Attention — The Mechanism That Changed Everything (coming soon)
The Transformer — The Complete Architecture (coming soon)
Fine-Tuning — From Base Model to Assistant (coming soon)

Series: How LLMs Work · rotecodefraktion.de

Translated from the German original with the help of Claude.