Neural Networks from Scratch

Neural Networks from Scratch

Article 3 of 8 · Series: How LLMs Work

We uncovered a problem in Article 2 that sounds more elegant than it is.

Embeddings are static. “Bank” has one vector, whether it means a park bench or a financial institution. The model sees this vector and has to somehow, somewhere, figure out from context what is meant. How does it do that? What happens between the moment an embedding vector enters the model and the moment logits come out?

The answer is neural networks. Not as a thought experiment adapted from nature, but as what they mathematically are: matrix multiplications, addition, a nonlinear function, applied repeatedly. We build this from scratch in this article, again without PyTorch or any framework, just numpy, so it becomes clear what happens behind the scenes. At this point, we can no longer avoid some math.

The Problem with Linear Transformations

Before we build a neuron, we should understand why we need one.

An embedding vector is a list of numbers. You could try to multiply it directly with a matrix to produce logits. That would be a linear transformation: logits = W · embedding + b. Simple, fast, differentiable.

The problem: Any number of linear transformations chained together are still a single linear transformation. W₂ · (W₁ · x + b₁) + b₂ can be collapsed into (W₂W₁) · x + (W₂b₁ + b₂) — one matrix and one bias. No matter how many layers we stack, the expressive power remains the same as a single layer.

A linear model can only learn linear relationships. But language is not linear. The meaning of “not” depends on what comes after it. “He didn’t win” and “He didn’t lose” — the word “not” flips the meaning, but how it does so is context-dependent.

What we need is nonlinearity — a function that enables the model to learn more complex relationships. That is the job of the activation function.

Why Nonlinearity?

The decisions a language model has to make rarely follow a simple sum. Imagine a machine that needs to distinguish dogs from cats in photos. No single feature is sufficient — neither ear shape, nor size, nor leg length. The decision emerges from the combination of multiple features, in a way that cannot be summed up: pointed ears plus small plus long legs suggests cat, pointed ears plus large suggests German Shepherd, floppy ears plus small suggests Dachshund.

A linear function fundamentally cannot solve this. It is literally just a weighted sum. And if we stack multiple linear layers on top of each other, the whole thing can be collapsed into a single sum — mathematically equivalent to a single layer. No matter how deep we build: the result is always a straight decision boundary, a plane, a hyperplane.

The XOR problem is the simplest proof of this. Four points: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. The ones lie diagonally, as do the zeros. There is no straight line that cleanly separates them. Every linear classifier fails on at least one point — we will see this concretely later.

ReLU — the simplest conceivable nonlinearity, just a kink at 0 — is enough to change this. The kink allows the network to split the input into two differently treated regions and compute differently in each region. This enables multiple layers to truly build on each other, instead of collapsing into one.

A Single Neuron

A neuron is the smallest unit. It takes multiple inputs, weights them, adds a bias, and applies an activation function.

import numpy as np

def relu(x):
    return np.maximum(0, x)

# A neuron: 3 inputs, 1 output
np.random.seed(42)
inputs  = np.array([0.5, -0.3, 0.8])   # Embedding values (3 dimensions)
weights = np.array([0.4, -0.7, 0.2])   # Learnable weights
bias    = 0.1                           # Learnable bias

# Forward pass
z = np.dot(inputs, weights) + bias     # Linear combination
a = relu(z)                            # Activation function

print(f"Inputs:    {inputs}")
print(f"Weights:   {weights}")
print(f"Bias:      {bias}")
print(f"z = w·x + b = {z:.4f}")
print(f"a = ReLU(z) = {a:.4f}")
Inputs:    [ 0.5 -0.3  0.8]
Weights:   [ 0.4 -0.7  0.2]
Bias:      0.1
z = w·x + b = 0.6700
a = ReLU(z) = 0.6700

Neuron Forward Pass — Interactive

x₁ 0.50
Weight w₁ = 0.4
x₂ -0.30
Weight w₂ = -0.7
x₃ 0.80
Weight w₃ = 0.2
x₁x₂x₃w₁ = 0.4w₂ = -0.7w₃ = 0.2Σ+ biasz = 0.67ReLUmax(0, z)aa = 0.67
z (before ReLU)
0.6700
a (after ReLU)
0.6700

z is the weighted sum — the linear combination of the inputs. a is the output after the activation function. ReLU is radically simple: everything below 0 becomes 0, everything above 0 stays unchanged. max(0, x).

The bias is an additional parameter added independently of the inputs. Think of it as a threshold shift: without bias, the neuron would always fire through the origin — with input [0, 0, 0] the output would necessarily be 0. The bias allows the neuron to have a base activation even without input, or to shift the point where ReLU kicks in. Without bias, the network’s expressive capabilities would be significantly limited.

The weights and the bias are the learnable parameters of this neuron. During training, they are adjusted so the model makes better predictions — we will look at how that works in Article 4.

A Complete Layer

A single neuron is useless. It becomes useful when many neurons operate in parallel and their outputs are combined — that is a layer.

import numpy as np

def relu(x):
    return np.maximum(0, x)

np.random.seed(42)

# Layer with 3 inputs and 4 neurons
inputs      = np.array([0.5, -0.3, 0.8])   # shape: (3,)
weights     = np.random.randn(3, 4) * 0.5  # shape: (3, 4) — one column per neuron
biases      = np.zeros(4)                   # shape: (4,)

# Forward pass for all 4 neurons simultaneously
z = inputs @ weights + biases              # shape: (4,)
a = relu(z)                                # shape: (4,)

print(f"Input shape:   {inputs.shape}")
print(f"Weights shape: {weights.shape}")
print(f"Output shape:  {a.shape}")
print(f"\nBefore ReLU (z): {z.round(4)}")
print(f"After ReLU  (a): {a.round(4)}")
Input shape:   (3,)
Weights shape: (3, 4)
Output shape:  (4,)

Before ReLU (z): [-0.0285  0.2176 -0.2603  0.0794]
After ReLU  (a): [0.     0.2176 0.     0.0794]

The weight matrix has shape (3, 4): 3 inputs, 4 neurons. The matrix multiplication inputs @ weights computes all 4 neurons simultaneously — this is why neural networks run so efficiently on GPUs. GPUs are fundamentally machines that parallelize matrix multiplications.

Two of the four neurons output 0 — their linear combination was negative and ReLU clipped them. This is not a problem, it is the principle: neurons “fire” only when their input exceeds a threshold. The sparsity is a useful property.

Activation Functions: ReLU and GELU

ReLU is historically important and still widely used. Transformers — the architectural foundation behind GPT, Claude, and Gemini — typically use a variant called GELU (Gaussian Error Linear Unit).

import numpy as np

def relu(x):
    return np.maximum(0, x)

def gelu(x):
    # Approximation used in practice
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

# Comparison for some values
print(f"{'x':6s}  {'ReLU(x)':10s}  {'GELU(x)':10s}")
print("-" * 30)
for x in [-2.0, -1.0, -0.5, 0.0, 0.5, 1.0, 2.0]:
    print(f"{x:6.1f}  {relu(x):10.4f}  {gelu(x):10.4f}")
x       ReLU(x)     GELU(x)
------------------------------
  -2.0      0.0000     -0.0455
  -1.0      0.0000     -0.1588
  -0.5      0.0000     -0.1543
   0.0      0.0000      0.0000
   0.5      0.5000      0.3457
   1.0      1.0000      0.8412
   2.0      2.0000      1.9545

The difference is subtle but important. ReLU cuts off hard at 0 — to the left the derivative is exactly 0, to the right exactly 1. This leads to the so-called “dying ReLU” problem: neurons that once activate strongly negative learn nothing afterward because their gradient is 0.

GELU behaves more smoothly. Slightly negative values are not completely zeroed out, but only dampened. The transition at 0 is smooth and differentiable. In practice, Transformers with GELU often train better and more stably than with ReLU — which is why GELU is the standard in modern architecture implementations.

Activation Functions — Interactive

-3-2-101233210-1-2f(x)x
x value 0.50

The MLP: Multiple Layers Stacked

A single layer is not yet a model. What it can learn is limited: it takes a weighted sum of its inputs, applies an activation function once — that’s it. For simple patterns that suffices, but not for language.

An MLP (Multilayer Perceptron) stacks multiple layers on top of each other, with an activation function in between. This is the crucial point: without activation in between, the layers could be collapsed into a single layer — we would have gained nothing. The activation function between layers breaks this collapsibility and allows the network to learn truly complex patterns.

The Transformer Convention

In a Transformer, the MLP module has a very specific, uniform structure:

  1. A hidden layer that expands the embedding dimension to four times its size (hence “wide hidden layer”). If the embedding has 64 dimensions, the hidden layer gets 256. If it has 4096 dimensions, the hidden layer gets 16,384.
  2. An activation function (GELU) applied to this wide intermediate space.
  3. An output layer that projects the dimension back to the original embedding format — or for token prediction, directly to the vocabulary size.

The ’expand and contract’ pattern is important: the wide intermediate space gives the network temporarily much room to build intermediate representations — a kind of large working memory. The output layer compresses the result back to the compact representation the rest of the network expects.

The following code implements this completely. We introduce two helpers: a Linear class for a single complete layer, and the softmax function for the very last step.

import numpy as np

def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

def softmax(x):
    x = x - np.max(x)
    e = np.exp(x)
    return e / e.sum()

class Linear:
    def __init__(self, n_in, n_out, seed=None):
        rng = np.random.default_rng(seed)
        self.W = rng.normal(0, 0.1, (n_in, n_out))
        self.b = np.zeros(n_out)

    def forward(self, x):
        return x @ self.W + self.b

class MLP:
    """Simple 2-layer MLP as found in a Transformer block."""
    def __init__(self, d_model, d_ff, d_out, seed=0):
        self.layer1 = Linear(d_model, d_ff,  seed=seed)
        self.layer2 = Linear(d_ff,    d_out, seed=seed+1)

    def forward(self, x):
        x = gelu(self.layer1.forward(x))   # Hidden layer with GELU
        x = self.layer2.forward(x)          # Output layer (no activation)
        return x

# Typical dimensions for a small Transformer:
# d_model = 64  (embedding dimension)
# d_ff    = 256 (4× d_model, classic Transformer convention)
# d_vocab = 10  (simplified mini vocabulary)

np.random.seed(0)
d_model, d_ff, d_vocab = 64, 256, 10

mlp = MLP(d_model=d_model, d_ff=d_ff, d_out=d_vocab, seed=0)

# An embedding vector (e.g., for token "is")
embedding = np.random.randn(d_model) * 0.1

# Forward pass
logits = mlp.forward(embedding)
probs  = softmax(logits)

print(f"Input (Embedding):  shape={embedding.shape}")
print(f"Hidden Layer:       shape=({d_ff},)  [after GELU]")
print(f"Output (Logits):    shape={logits.shape}")
print(f"\nLogits:  {logits.round(3)}")
print(f"Probs:   {probs.round(3)}")
print(f"Sum:     {probs.sum():.6f}")
print(f"\nTotal parameters: {d_model*d_ff + d_ff + d_ff*d_vocab + d_vocab:,}")
Input (Embedding):  shape=(64,)
Hidden Layer:       shape=(256,)  [after GELU]
Output (Logits):    shape=(10,)

Logits:  [ 0.048  0.111 -0.026  0.01  -0.063  0.035 -0.038 -0.012  0.047  0.066]
Probs:   [0.103 0.11  0.096 0.099 0.092 0.102 0.094 0.097 0.103 0.105]
Sum:     1.000000

Total parameters: 19,210

What Happens Step by Step

The code looks dense but only does four things, each of which we already built before:

1. The Linear class bundles what we did by hand in “A Complete Layer”: it holds a weight matrix W and a bias vector b, and its forward method computes x @ W + b — the weighted sum for all neurons simultaneously. No magic, just bookkeeping.

2. The MLP class combines two such Linear layers:

  • layer1: projects from d_model=64 to d_ff=256 neurons (the wide hidden layer)
  • layer2: projects from d_ff=256 back to d_out=10 (in this example, the vocabulary size)

In forward, GELU is applied only between the two layers — this is exactly the nonlinearity that separates layer 2 from layer 1. After layer 2, there is no activation, because the output should be raw logits.

3. The forward pass takes a single embedding vector with 64 numbers, expands it to 256 numbers, passes it through GELU, and compresses it back to 10 numbers. These 10 numbers are the logits — one score for each of the 10 possible tokens in the vocabulary.

4. softmax converts these 10 scores into 10 probabilities that sum to 1.0. This is the same function we already saw in Article 1 during token sampling.

Reading the Output

In the output we see that the probabilities all hover around 10% — nearly uniform. This is not a bug, it is by design: the network was initialized with random weights. It has never seen text, learned nothing. All we are measuring is the architectural noise.

That a complete forward pass works nonetheless — from embedding to probability distribution — is the point. The mechanics are complete. What is missing is training: the process that adjusts the weights so that the output probabilities become meaningful predictions. That is Article 4.

One more number worth remembering: 19,210 parameters for this mini-MLP. Those are all weights plus biases of both layers combined. Sounds like a lot for a network that cannot do anything — and is simultaneously laughably little compared to real models, as we will see shortly.

Why Depth Works

A common misconception: more layers does not simply mean “more compute.” Depth enables hierarchical representations.

The classic example is the XOR problem — it cannot be solved with a linear model, but is trivial with two layers:

import numpy as np

def relu(x):
    return np.maximum(0, x)

# XOR: [0,0]→0, [0,1]→1, [1,0]→1, [1,1]→0
# No linear classifier can solve this — the classes are not linearly separable.

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

# Hand-set network that solves XOR
# Layer 1 learns: "at least one is 1" (neuron 0) and "both are 1" (neuron 1)
W1 = np.array([[1, 1], [1, 1]], dtype=float)
b1 = np.array([0, -1], dtype=float)

# Layer 2 combines: "at least one" MINUS 2 × "both" = XOR
W2 = np.array([[1], [-2]], dtype=float)
b2 = np.array([0], dtype=float)

print("XOR with 2-layer MLP:")
print(f"{'Input':12s}  {'Expected':10s}  {'Output':10s}")
print("-" * 36)
for x, target in zip(X, y):
    h   = relu(x @ W1 + b1)
    out = h @ W2 + b2
    pred = 1 if out[0] > 0.5 else 0
    ok = "✓" if pred == target else "✗"
    print(f"{str(x):12s}  {target:10d}  {pred:10d}  {ok}")
XOR with 2-layer MLP:
Input         Expected     Output
------------------------------------
[0 0]                 0           0  ✓
[0 1]                 1           1  ✓
[1 0]                 1           1  ✓
[1 1]                 0           0  ✓

XOR Problem — Interactive

0101x₁x₂

The first layer learns two sub-concepts: “at least one of the inputs is 1” and “both inputs are 1.” The second layer combines these intermediate results into the final answer. No single layer could do this — the nonlinearity (ReLU) between the layers makes it possible.

In a real language model, the same thing happens at a different scale. The first layers learn simple patterns — morphology, common word pairs. Middle layers learn syntactic structures. Deep layers learn semantic relationships and world knowledge. This hierarchy emerges not through explicit design, but from the structure of the problem.

How Large Is a Real Model?

Our mini-MLP had 19,210 parameters. For comparison: Llama 3.1 405B — one of the largest publicly documented open-source models — has 405 billion parameters, distributed across 126 Transformer layers with an embedding dimension of 16,384 and an MLP hidden layer of 53,248. The MLP of a single layer alone has over 1.7 billion parameters — and stacked 126 times, that accounts for the bulk of the 405 billion.

Proprietary flagship models like GPT-5 or Claude Opus are presumably similar in size or larger, but their exact architectural details are not public. What is clear: model size scales with capability, at least up to a point. More parameters means more capacity to store patterns.

Interesting to note: the forward pass through all these layers follows exactly the same principle we built here. Matrix multiplication, activation function, matrix multiplication again. The architecture is simple — the scale is not.

What Is Still Missing

We can now send an embedding vector through an MLP and get logits. What we cannot do: let the network learn.

Our weights are randomly initialized. The logits that come out reflect nothing meaningful — uniformly distributed probabilities over the vocabulary, as we saw above. To make useful predictions, the weights must be adjusted.

This happens through backpropagation — the algorithm that propagates the model’s error back through all layers and computes for each weight in which direction it needs to change to improve predictions. The mathematical foundation is the chain rule from calculus — and that is the topic of Article 4.

All Articles in the Series

  1. The Next Word — How Language Models Work
  2. Words as Points in Space — What Embeddings Really Are
  3. Neural Networks from Scratch ← this article
  4. Backpropagation — How a Model Learns (coming soon)
  5. Context and RNNs — Why Order Matters (coming soon)
  6. Attention — The Mechanism That Changed Everything (coming soon)
  7. The Transformer — The Complete Architecture (coming soon)
  8. Fine-Tuning — From Base Model to Assistant (coming soon)

Series: How LLMs Work · rotecodefraktion.de


Translated from the German original with the help of Claude.