The Transformer, the Complete Architecture

The Transformer, the Complete Architecture

Article 7 of 8 · Series: How LLMs Work

In Article 6 we met the mechanism that changed everything. Attention lets every token see every other token in the context directly, in a single matrix multiplication. No more notebooks, no more game of telephone.

But attention alone isn’t a language model. Three building blocks are missing, and without them the whole thing wouldn’t work. In this article we add them, assemble everything into a complete block, stack several into an architecture, and look at how this construction became GPT, BERT, Llama, and Claude.

Three gaps that attention doesn’t close

Before we build, let’s look once more at what’s missing in the mechanism from Article 6.

Position. Self-attention is permutation-invariant. If we shuffle the tokens of a sequence, the result barely changes. For mathematical operations that’s a beautiful property. For language it’s a disaster. Dog bites man and man bites dog land in the same output space inside a self-attention layer. We have to teach the model that order matters.

Depth. A single attention layer does a single weighted mixing. That’s powerful but limited. Language has hierarchical structure — words form phrases, phrases form sentences, sentences form arguments. One layer can do one mixing. For hierarchical understanding we need several, stacked.

Stability. Deep stacks struggle with vanishing gradients from Article 5. The gradient in backpropagation gets multiplied through every layer, and when the values average below 1 it shrinks toward zero. Without countermeasures, a stack of twelve, let alone 96 attention layers, will collapse during training.

Three problems, three solutions. Let’s start with position.

Position: tokens know their place

Self-attention receives a sequence of token embeddings. Each token compares itself to every other, weights the values accordingly, done. There’s no place where the model knows where a token sits in the sequence.

The solution is positional encodings. We give every embedding additional information: “you are token 1,” “you are token 2,” “you are token 17.” Concretely, we add a position vector onto every embedding.

def add_positional(embeddings, positions):
    # embeddings: (n, d_model)
    # positions:  (n, d_model)
    return embeddings + positions

One line. But the question is: where does the position vector come from?

Sinusoidal positional encoding

The original 2017 paper uses a closed form built from sine and cosine waves at different frequencies:

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Sounds complicated, but it’s elegant. Each dimension oscillates at a different frequency, low dimensions slowly, high dimensions quickly. Position 0 gets a different wave pattern than position 1, position 17 again a different one. The model can read the position out of these patterns.

The actual trick: because sine and cosine are periodic, arbitrarily many positions can be encoded without the encoding table having to be finite in size. A model trained on sequences up to length 512 can in theory still get sensible encodings at length 1000, because the waves simply continue.

Four sine waves of different frequencies, with three positions marked, each a unique combination of values

The diagram makes the principle visible: each dimension oscillates at its own frequency, from slow (DIM 0) to fast (DIM 3). At every position the model reads a value from each dimension — the combination is nowhere repeated. Position 3, position 7, and position 11 therefore look distinguishable to the model, without any position ID being written down anywhere.

Learned positional embeddings

GPT-2 and BERT take a different route. Each position gets its own learnable embedding vector, just as tokens get learnable embedding vectors (Article 2). Position 1, position 2, … position 1024 — each one is its own entry in an embedding table. The model learns by itself what a given position means.

Advantage: more flexible. Disadvantage: hard-capped at the training length. Beyond the maximum position the model has no vectors and produces nonsense.

RoPE and ALiBi, the modern variants

Llama, Claude, and most current models use RoPE (Rotary Position Embedding). Instead of adding a position vector, RoPE rotates the query and key vectors by a position-dependent angle. The difference between two positions is thereby built directly into the dot product of attention.

Three Q vectors at positions 0, 4, and 8, each rotated by a position-dependent angle in the Q/K space

Practical effect: relative positions are directly accessible to the model. A token at position 5 automatically knows that the token at position 7 is two steps to its right. In the dot product of Q and K, only the angle difference remains — exactly the relative distance — and the absolute position drops out. Works better for long contexts and extrapolates better beyond the training length.

ALiBi (Attention with Linear Biases) makes it even simpler: add a position-dependent bias directly to the attention scores, decreasing linearly with distance. No encoding vectors needed, hardly any extra parameters, very robust under extrapolation.

Which method you choose depends on the model. For our conceptual understanding it’s enough that every token knows its position, somehow.

Feed-Forward: the forgotten third

Self-attention is only half the story of a transformer block. Right after attention comes a second component, mentioned almost casually in the original paper but actually responsible for the bulk of the model’s parameters: the Feed-Forward Network, FFN for short.

The structure is as simple as it gets. Two linear layers with an activation in between:

def feed_forward(x, W_1, W_2):
    """Two linear layers with an activation in between."""
    h = relu(x @ W_1)        # (n, d_model) -> (n, d_ff)
    return h @ W_2            # (n, d_ff)    -> (n, d_model)

The inner dimension d_ff is usually four times as large as d_model. With d_model = 768 the FFN works internally with 3072-dimensional vectors before projecting back to 768. Expand, transform, project back.

Important: the FFN is applied per token. Unlike attention, where tokens talk to each other, the FFN treats each token in isolation. It’s a pointwise operation, fully parallelizable. Token i sees nothing of token j inside the FFN.

Why is it even there?

At first glance the FFN looks trivial compared to attention. So why is it in every block?

Nonlinearity. Self-attention is essentially a weighted mixing of vectors — linear, except for the softmax. Without the FFN the network couldn’t learn complex, nonlinear functions. The ReLU (or GELU in modern models) brings in the nonlinear processing the network needs to actually construct new meaning out of a mixture.

Storage. An interesting finding from recent research: the FFN is probably where factual knowledge is stored. Attention searches and mixes, the FFN stores and transforms. In large models, roughly two-thirds of the parameters live in the FFN, one-third in attention. If you want to know where “Paris is the capital of France” sits inside the model, the FFN is the prime suspect.

But where exactly? “Prime suspect” isn’t the same as “crime scene,” and the honest answer is: we don’t know.

Studies like ROME (Meng et al. 2022) and MEMIT (2023) have shown that factual associations can be edited through targeted interventions in individual FFN layers. With a few tweaked weights, “Paris is the capital of France” becomes “Paris is the capital of Italy.” Works. But messily:

Before edit:
  "Paris is the capital of ___"          →  France
  "Where was Napoleon crowned?"          →  Notre-Dame in Paris
  "Do Parisians speak French?"           →  Yes

After ROME edit (Paris → Italy):
  "Paris is the capital of ___"          →  Italy      [as intended]
  "Where was Napoleon crowned?"          →  Rome       [leak]
  "Do Parisians speak Italian?"          →  Yes        [leak]

Edits leak onto related facts, generalize poorly, leave inconsistencies. Enough to say “the FFN is involved.” Not enough to say “the FFN is the storage location.”

The deeper reason lies in the architecture itself. Knowledge in the model is distributed, not localized: a single fact lives in a pattern across many neurons and usually across multiple layers. Conversely, a single neuron does not represent one concept but many simultaneously. Anthropic’s research on superposition (Elhage et al. 2022) explains why: models pack significantly more concepts than they have dimensions for, by overlaying meanings.

Superposition: five feature directions in two dimensions, vectors overlap

In two dimensions you can fit five conceptually distinct features — but not orthogonally. The vectors overlap, with interference, but functionally. Scaled to thousands of dimensions you get a dense overlay in which a single neuron simultaneously encodes tens of thousands of micro-concepts.

How concrete this gets is shown by Anthropic’s “Scaling Monosemanticity” (Templeton et al. 2024): using sparse autoencoders, the researchers extracted millions of interpretable features from the activations of Claude 3 Sonnet — from “Golden Gate Bridge” to “code bug” to “sycophancy.” Clamp the “Golden Gate Bridge” feature direction to its maximum value and the model can’t stop talking about the bridge in every reply. The feature is real, locatable, manipulable. But it doesn’t live in a single neuron — it lives as a direction in a high-dimensional activation space, found with the help of a specially trained auxiliary model.

The follow-up “On the Biology of a Large Language Model” (Anthropic, March 2025) extended the method to Claude 3.5 Haiku and used attribution graphs and circuit tracing to document how the model resolves multi-step operations internally — rhyme planning when writing poetry, multilingual reasoning via a cross-lingual representation, mathematical heuristics rather than symbolic rules. The methodology is now established across multiple model generations.

Even so: for the current frontier models of the 4.x and 5.x generations there is no public interpretability map of comparable depth. That’s not a coincidence but structural — interpretability research needs access to model internals, careful analysis, and time to publish. It systematically lags behind the release cadence of new models.

What remains are interventional clues and identified structures for older models, not plain-text maps for the current ones. We know that interventions in the FFN change factual recall. We know that coherent features can be extracted. We don’t know whether the knowledge is stored there or only retrieved — and for the models we actually work with today, we know only a fraction of the answers that have been published for their predecessors.

GLU and SwiGLU, the modern variants

Llama and several other modern models use variations with gates:

def swiglu_ff(x, W_gate, W_up, W_down):
    """SwiGLU FFN, as in Llama."""
    return (silu(x @ W_gate) * (x @ W_up)) @ W_down

The * is elementwise multiplication, a gate that decides which parts get through. Empirically better than the original FFN, but conceptually the same scheme: expand first, project back.

Residual Connections, the direct wire

We now have two big operations per block: multi-head attention and feed-forward. Both substantially change the vector for every token. When we stack multiple blocks, we propagate these changes across many layers — and that’s where it gets critical.

From Article 5 we remember vanishing gradients. Deep networks can collapse during training because the gradient signal weakens on the backward pass, multiplied with every layer.

The solution comes from the ResNet world of computer vision in 2015 and is breathtakingly simple: residual connections. Instead of just passing on the output of an operation, we add the input back in:

output = x + sublayer(x)

One line. But conceptually the most important trick in the transformer.

The idea behind it: the model doesn’t learn “the new vector is X” but “the new vector is X plus a few adjustments.” If the adjustments would be harmful, the model can learn them down to zero, and the input flows through unchanged. That makes depth trainable. On the backward pass, the gradient has a direct path back through every residual connection without having to go through the sub-operation.

Comparison: a stack without residual connections (input only flows through every operation) vs. with residual connections (an additional direct bypass path to the output)

The visual difference is small, the effect dramatic: on the left every signal — activations on the forward pass, gradients on the backward pass — has to go all the way through every operation. On the right there’s an additional direct path. If the operations collapse or attenuate, the signal survives via the bypass. That’s the trick that makes stacks of 96 or 126 layers trainable in the first place.

In the transformer block, a residual connection comes after every sub-operation, one after attention and one after feed-forward.

Layer Normalization, the stabilizer

With residual connections we can stack deep. But deep stacks have another problem: values can explode or collapse. After 50 or 100 layers, normally distributed embeddings may have turned into vectors with gigantic or microscopic magnitudes. The next operation works on unstable values, and the training run derails.

Layer Normalization stabilizes this. Per token, the mean is set to 0 and the variance to 1:

def layer_norm(x, gamma, beta, eps=1e-5):
    """Normalizes per token, not per batch."""
    mean = x.mean(axis=-1, keepdims=True)
    var  = x.var(axis=-1, keepdims=True)
    return gamma * (x - mean) / np.sqrt(var + eps) + beta

gamma and beta are learnable scale and shift parameters. The model can decide how strongly it wants to normalize and re-set the value range afterwards.

Important: LayerNorm normalizes per token across the feature dimension. That’s different from batch normalization (from computer vision), which normalizes per feature across the batch. LayerNorm has the advantage of being independent of batch size, which is essential for language models with variable sequence length.

Pre-LN vs Post-LN

In the original paper, LayerNorm comes after the sub-operation and the residual:

# Post-LN, as in the original paper
x = layer_norm(x + multi_head_attention(x))
x = layer_norm(x + feed_forward(x))

Modern models mostly use Pre-LN, where LayerNorm is applied before the sub-operation:

# Pre-LN, as in GPT-2 and all modern models
x = x + multi_head_attention(layer_norm(x))
x = x + feed_forward(layer_norm(x))

Empirically, Pre-LN trains much more stably for deep models. The main branch (x + ...) stays untouched, normalization happens only in the “side path” of the sub-operation. The direct residual path is clean, the gradient can flow back undisturbed.

RMSNorm, the leaner variant

Llama and several newer models use RMSNorm instead of LayerNorm. RMSNorm skips the mean subtraction and normalizes only by the root mean square:

def rms_norm(x, gamma, eps=1e-5):
    rms = np.sqrt((x**2).mean(axis=-1, keepdims=True) + eps)
    return gamma * x / rms

Less computation, almost identical results. A typical example of the pragmatic simplifications that flow into modern architectures.

The complete block

Now we put everything together. A transformer block in the Pre-LN variant that has become the standard:

def transformer_block(x, attn, ffn, ln1, ln2):
    """A complete transformer block."""
    # Attention sublayer
    x = x + multi_head_attention(layer_norm(x, *ln1), *attn)
    # Feed-forward sublayer
    x = x + feed_forward(layer_norm(x, *ln2), *ffn)
    return x

Data flow through a transformer block: x goes through LayerNorm, multi-head attention, gets combined with the original x via residual, then through LayerNorm, feed-forward, and again combined via residual to the output

Four operations per block: two LayerNorms, one attention, one feed-forward. All wrapped in residual connections. That’s the building block from which the entire model is constructed by stacking.

If we count the parts — multi-head attention with h heads of d_k dimensions each, FFN with 4·d_model inner width, plus two LayerNorms — we land at a few million parameters per block. With d_model = 4096 and d_ff = 16384, as in Llama 2 7B, it’s about 200 million parameters per block.

Parameter count in detail

Per transformer block (excluding embeddings), the parameters are:

  • Attention: four matrices W_Q, W_K, W_V, W_O, each d_model × d_model. That’s 4 · d_model² parameters.
  • Feed-forward: two matrices W_1 (size d_model × d_ff) and W_2 (size d_ff × d_model). With d_ff = 4·d_model that’s 8 · d_model² parameters. Plus two bias vectors with d_ff + d_model values.
  • LayerNorm: two of them with 2·d_model parameters each (gamma, beta).

In total, roughly 12 · d_model² + 13·d_model parameters per block. Dominated by the d_model² terms, two-thirds of which sit in the FFN.

With d_model = 4096 (Llama 2 7B configuration):

  • Attention: 4 · 4096² ≈ 67M
  • Feed-forward: 8 · 4096² ≈ 134M (with SwiGLU it’s even more)
  • LayerNorm: negligible

Total roughly 200M parameters per block. With 32 blocks we land at about 6.4 billion — and then the embedding layer is added (vocabulary size V times d_model, roughly 130M with V = 32000). Together that gives the ~6.7 billion that the Llama 2 7B model has.

Depth through stacking

A single block is one mixing. Multiple blocks in a row are a sequence of mixings. Each block takes the output of the previous one as input and works further on it:

def transformer(x, blocks_params):
    for params in blocks_params:
        x = transformer_block(x, *params)
    return x

How many blocks do real models use?

ModelLayersd_modelHeads
Original Transformer (2017)65128
BERT-Base1276812
BERT-Large24102416
GPT-2 (1.5B)48160025
GPT-3 (175B)961228896
Llama 3 70B80819264
Llama 3 405B12616384128

With every layer the model has another opportunity to process information. What happens in the different layers? Research on mechanistic interpretability has produced remarkable insights here in recent years.

Empirically, researchers observe a hierarchy:

  • Lower layers process surface structure: word boundaries, simple syntax, local phrases.
  • Middle layers handle more complex syntax and meaning: subject-verb relationships, coreferences, semantic roles.
  • Upper layers integrate world knowledge, abstract concepts, patterns of argumentation.

Stack of twelve transformer blocks, organized into three zones: lower layers (word boundaries, local phrases, simple syntax), middle layers (subject-verb relationships, coreferences, semantic roles), upper layers (world knowledge, argumentation, abstract concepts)

This hierarchy isn’t hard-coded. It emerges during training because it pays off. The same observation holds for deep convolutional networks in computer vision — lower layers detect edges, upper layers entire objects. Deep transformers do the analogous thing for language. The architecture doesn’t impose a hierarchy, but it makes one possible.

Encoder, decoder, or both

The original 2017 paper introduced an encoder-decoder transformer, intended for machine translation. Two towers of blocks:

  • The encoder processes the source sentence, every token sees every other, no causal mask.
  • The decoder generates the target sentence, sees only what has already been generated, with the causal mask from Article 6.
  • Cross-attention connects the two towers: the decoder queries the encoder output with Q against the encoder’s K/V (also from Article 6).

But in 2018 two papers showed that the two halves are also useful independently. That turned the whole landscape upside down.

Three architecture variants compared: encoder-only (BERT) for classification, decoder-only (GPT, Claude, Llama) for generation, and encoder-decoder (T5) for translation with cross-attention between the two towers

BERT, the encoder-only specialist

BERT (2018, Google) uses only the encoder side. Bidirectional attention, no causal mask, every token sees every other. It’s trained with masked language modeling: individual tokens are randomly masked, and the model has to predict them from the context on both sides.

BERT doesn’t generate text, but it produces excellent contextualized embeddings. For classification, search, question answering, named entity recognition, it’s still the foundation of many systems today. If you see any NLP pipeline that doesn’t generate but classifies or extracts, there’s a high chance there’s a BERT-style encoder-only model running underneath.

GPT, the decoder-only language model

GPT (2018, OpenAI) went the opposite way: only the decoder, with causal mask. Trained with classical language modeling — the trick from Article 1: predict the next word given all previous ones.

Conceptually simpler than BERT, but it scales fantastically. GPT-2 (2019), GPT-3 (2020), GPT-4 (2023), ChatGPT, Claude, Llama, Mistral — all current generative AI models are decoder-only transformers in the spirit of GPT.

The advantage: pure autoregression, no encoder needed, simple inference. The disadvantage: each token sees only the past, no bidirectional information. But that turned out not to matter much — with enough scale and clever task framing, decoder-only catches up on almost all encoder advantages.

T5 and friends, the original form

Some models still use the full encoder-decoder architecture. T5 (Google) frames every NLP task as text-to-text and uses a classical encoder-decoder. Many translation and summarization models also stick with the variant.

In practice, decoder-only dominates today. The pipeline is simpler, training is more homogeneous, and scaling behavior is better understood. If you see a chat assistant, there’s a good chance a decoder-only transformer sits behind it.

From block to language model

We now have a stack of transformer blocks. To make this into a language model, two pieces are still missing at the ends.

Front end: token embeddings + positional encoding. Inputs are token IDs from the tokenizer (Article 1), so numbers. They get turned into vectors via an embedding table (Article 2), positional encodings are added on top or built in via RoPE, then it goes into the first block.

Back end: output head. After the last block we have a contextualized vector for every token. We project it through a linear layer onto the size of the vocabulary and run a softmax over it. Out comes a probability distribution over the next token.

def language_model(token_ids, params):
    x = embed(token_ids, params.embeddings)
    x = x + positional_encoding(len(token_ids))

    for block_params in params.blocks:
        x = transformer_block(x, *block_params)

    x = layer_norm(x, params.final_ln)
    logits = x @ params.embeddings.T   # tied embeddings
    return logits

That’s the complete pipeline. A few dozen lines of Python for a conceptual language model. Real implementations add efficiency optimizations like KV cache, Flash Attention, and mixed precision, but the architecture stays the same at its core.

A detail: in many models the embedding weights and the output projection are shared. The table that turns token IDs into vectors is the same one (transposed) that turns vectors back into logits at the end. This is called tied embeddings and it saves parameters without hurting performance.

Scaling, the real secret

The architecture we just described has been remarkably stable since 2017. GPT-1 from 2018 has the same basic structure as Llama 3 from 2024. What has changed is scaling.

ModelYearParametersTraining tokens
Original Transformer201765M~10B
BERT-Large2018340M~3B
GPT-220191.5B~10B
GPT-32020175B~300B
Llama 2 70B202370B2T
Llama 3 70B202470B15T
Llama 3 405B2024405B15T
GPT-4 (estimated)2023~1.5T?

More parameters, more data, longer training. In 2020 the OpenAI researchers showed with the Scaling Laws that performance gets predictably better when you scale parameters, data, and compute proportionally. That turned the whole industry upside down: “we need smarter architectures” became “we need more compute and more data.”

The architectural refinements since 2017 are more like micro-optimizations — RoPE instead of sinusoidal encodings, GQA (Grouped-Query Attention) for inference efficiency, RMSNorm instead of LayerNorm, SwiGLU instead of ReLU or GELU. All important, but none of it changes the basic concept. The transformer of 2017 is still the transformer of 2025.

What a scaled transformer can really do

With the architecture and enough scale, something emerges that researchers in 2017 wouldn’t have expected. Emergent capabilities appear above a critical model size, without anyone having explicitly programmed them in:

  • Few-shot learning — the model learns a new task from two or three examples in the prompt, without any weights being adjusted.
  • Reasoning — step-by-step argumentation in multi-part problems, often better when the model is explicitly asked to “think out loud” (chain-of-thought).
  • Code generation — turning natural language into runnable code in Python, JavaScript, Rust, or Go.
  • Tool use — using tools like search engines, calculators, APIs in a targeted way.
  • In-context learning — picking up patterns from the prompt and continuing them, even with completely new symbols or languages.

These capabilities are not hard-coded. They emerge because a sufficiently large model was trained on enough text to predict the next token. Nothing more. It is still one of the most astonishing phenomena in computer science of the last few years.

Whether that’s “real understanding” or just a very good statistical trick is an open question that may never be cleanly answered. In practice it works either way. And the architecture that makes it possible is exactly the one we just built.

What’s coming in the next part

We now have a base model, a so-called foundation model. It can complete text because that’s what it was trained for. But when you ask ChatGPT a question, you want a helpful answer, not the most likely completion. When you ask Claude for code, it should write code, not ramble about code. When you ask for instructions on building a bomb, the model should refuse rather than hand them out.

The leap from base model to assistant is the topic of Article 8: fine-tuning. We’ll look at:

  • Supervised fine-tuning (SFT) — training the model on desired behavior using curated question-answer pairs.
  • RLHF (Reinforcement Learning from Human Feedback) — learning from human feedback what a good answer is.
  • DPO and newer methods — making RLHF simpler and more stable.
  • Instruction tuning and chat templates — the pipeline that turned GPT-3 into ChatGPT.
  • Alignment — bringing the model into line with human values, or at least into the same neighborhood.

That closes the series. We’ll then have the complete picture: from the single token in the first article to the production-ready assistant.


Translated with the help of Claude.

All articles in the series

  1. The next word, how language models work
  2. Words as points in space, what embeddings really are
  3. Neural networks from scratch
  4. Backpropagation, how a model learns
  5. Context and RNNs, why order matters
  6. Attention Is All You Need
  7. The transformer, the complete architecture ← this article
  8. Fine-tuning, from base model to assistant (coming soon)

Series: How LLMs Work · rotecodefraktion.de