Attention Is All You Need

Article 6 of 8 · Series: How LLMs Work

In Article 5 we hit a wall. RNNs learn context by squeezing it through a notebook, token by token. That works for short sentences. For long texts the beginning gets lost in noise, and training through a 2,000-token sequence on a GPU runs slower than researchers’ patience could bear.

The question was: does it have to be this way? Does information have to travel sequentially, left to right, through a single bottleneck? Or is there another way?

In 2017 the answer arrived. Eight researchers at Google Brain published a paper with arguably the most famous title in NLP history: “Attention Is All You Need” (Vaswani et al., 2017). Their claim: the entire recurrent apparatus isn’t needed. A single mechanism does the job, and it’s surprisingly simple. It’s called Attention. This article shows what it is, how it computes, and why it rebuilt the language-model world from the ground up.

The Core Idea, a Library Instead of Telephone Game

Imagine we want to understand a sentence. More precisely: we want to sharpen the meaning of one specific word in the sentence by looking at the other words.

In an RNN the strategy was the notebook: everything that came before was written into it step by step and read back when the current word arrived. Designed as memory, but in practice the repeated application of W_hh turned it into a telephone game, especially across long sequences.

The Attention idea is different. Imagine a library. Each word in the sentence puts down two things:

An index card with a few keywords describing what it is (that’s the Key)
A book with the actual content (that’s the Value)

Now when the word bank wants to understand its surroundings, it formulates a query (that’s the Query) and compares it against every index card in the library. Cards that match the query well get a high weight. Cards that match poorly get a low weight. The books are then returned as a weighted blend, dominated by the books whose cards matched best.

That’s exactly Attention. Three roles per token, three vectors per token, one search, one weighted mix of contents.

What’s remarkable: every token in the sequence can run this search for itself, in parallel. No one has to wait for the previous word to finish. The notebook bottleneck disappears.

Three Roles, Three Projections

What are these three vectors concretely? We start with what we know from Article 2, the embedding for each token. A vector with, say, 768 dimensions, encoding the meaning of the word.

From this single embedding the model produces three different views:

The embedding is multiplied with W_Q, giving the Query vector q
The embedding is multiplied with W_K, giving the Key vector k
The embedding is multiplied with W_V, giving the Value vector v

W_Q, W_K and W_V are three weight matrices the network learns. There’s nothing mystical about them, just three linear layers like in Article 3, without an activation function.

Three projections from one embedding. And that’s the point. The word has a meaning, but it can play three different roles depending on who’s looking. As Query it asks: what am I looking for? As Key it announces: here’s what I offer, find me if you need this. As Value it provides: here’s what I look like when you take me.

On first reading this feels artificial. Why three roles? Why not just use the embedding directly for everything? The answer is pragmatic. The three projections give the network degrees of freedom. The word bank can signal something different as a Query (“I need context about what’s left and right of me”) than as a Key (“I’m a noun with two possible meanings”) than as a Value (“my likely contents are money or seating”). Three views, three specialized functions, all derived from the same embedding, jointly optimized in training.

The Computation in Four Steps

Enough metaphor, here’s the concrete process. Self-attention for an entire sequence, computed once. The whole operation fits into a few lines of Python.

import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

def self_attention(X, W_Q, W_K, W_V):
    """A single self-attention step."""
    Q = X @ W_Q                          # (n, d_k)
    K = X @ W_K                          # (n, d_k)
    V = X @ W_V                          # (n, d_v)

    scores  = Q @ K.T / np.sqrt(Q.shape[-1])   # (n, n)
    weights = softmax(scores, axis=-1)         # (n, n)
    output  = weights @ V                      # (n, d_v)
    return output, weights

Six lines, three of them simple matrix multiplications. What’s happening here? Step by step:

Step 1, three projections. From the input X (a matrix with one row per token) we compute the three views Q, K and V. Each is again a matrix with one row per token.

Step 2, measure similarities. Q @ K.T is a matrix with n × n entries. Entry (i, j) is the dot product between the Query of token i and the Key of token j. Dot product means: high value when the vectors point in similar directions, low value when they have little to do with each other. That’s the library lookup, every token compares its Query against every Key.

Step 3, normalize with softmax. The raw similarities are converted row by row into a probability distribution by softmax. Per row the weights sum to 1. The result is the attention matrix: for each token (row) the proportions with which it considers all other tokens (columns).

Step 4, weighted mix of values. weights @ V mixes the value vectors accordingly. Token i gets a weighted sum of the values of all other tokens, weighted by the attention proportions from step 3.

That’s it. Self-attention is nothing more than that.

A word on the scaling by √d_k in step 2: in high dimensions dot products grow on average, and the softmax distribution becomes too peaked, almost only zeros and ones, which kills the gradient during training. Dividing by √d_k keeps the values in a trainable range.

∑ The Attention formula, compact ▸

In the original paper the entire procedure is summarized in a single line:

Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V

d_k is the dimension of the Query and Key vectors. Q has dimension (n, d_k), K has (n, d_k), V has (n, d_v). The output has dimension (n, d_v), so the same number of tokens as the input, just with new contextualized vectors.

The scaling by √d_k comes from the following observation: if the components of Q and K are random with mean 0 and variance 1, the dot product q · k has variance d_k. For large d_k the values before softmax become very large, which pushes the gradients of softmax to almost zero. Dividing by √d_k brings the standard deviation back to 1, keeping the softmax trainable.

Backpropagation works without any special tricks. The operation is a chain of matrix multiplications and softmax, all differentiable. Gradients flow through Q, K and V, eventually landing on W_Q, W_K, W_V.

An Example, “The bank by the river was old”

Let’s see what the attention matrix says about a concrete sentence. Take a sentence that picks up the word bank from Article 5:

The bank by the river was old.

Seven tokens. If we run self-attention on this sentence (with a trained model), we get a 7 × 7 matrix of attention proportions. Each row is a token, each column shows how much that token contributes to the meaning of the row.

What we’d see in a properly trained model: the row for bank has high values at river and old, low ones at The and was. The model has learned that river is the decisive piece of information for the correct meaning of bank. It pulls semantic context directly from the sentence, not through a sequential notebook, but through a single matrix multiplication.

That’s exactly the punchline. In an RNN, the information river would have had to flow through by first, then to the next step, through two applications of W_hh, each with loss. In self-attention it arrives in one step.

The visualization shows this. Bright cells mean high attention, dark cells mean low. In practice attention matrices look like point clouds, certain tokens look at certain other ones, and the patterns hint at what the model has “understood.”

Self-Attention vs Cross-Attention

What we’ve described so far is self-attention. Q, K and V all come from the same sequence. Every token looks at every other token in the same sentence.

There’s also cross-attention. Here Q and K/V come from different sequences. The classic application is machine translation. The encoder processes the German source sentence and produces embeddings. The decoder generates the English target sentence, token by token. When generating each English token, the decoder queries with its own Query into the encoder’s output, K and V come from the encoder, Q from the decoder. That’s how the decoder selectively pulls what it needs from the source sentence.

Self and cross are exactly the same computation. The only difference is where the three vectors come from. Q and K/V from the same tensor, or from two different ones.

In modern language models like GPT, Claude or Llama, self-attention plays the leading role. Cross-attention shows up mainly in encoder-decoder models like T5 or the original transformers for translation.

Multi-Head, Several Searches in Parallel

A single attention operation is powerful, but it does only one thing at a time. One search, one set of Q/K/V projections.

Language is complex, though. For the word bank we might want to know simultaneously: what’s the semantics (money or seating), what’s the syntax (subject or object), is there a long-range reference (which pronoun later refers back to it). A single attention operation has to pack all that into one search. It can, but it’s tight.

Multi-head attention solves this by running several attention operations in parallel. Instead of one set W_Q, W_K, W_V the model has h sets, each with smaller dimensions. Each head does its own search, with its own Q/K/V projections. Afterwards the outputs of all heads are concatenated and pushed through a final projection W_O back to the original dimension.

def multi_head_attention(X, heads):
    """Multi-head attention, simplified."""
    head_outputs = []
    for W_Q, W_K, W_V in heads:
        out, _ = self_attention(X, W_Q, W_K, W_V)
        head_outputs.append(out)
    concat = np.concatenate(head_outputs, axis=-1)
    return concat @ W_O

Conceptually that’s all there is to it. h parallel, smaller self-attentions, whose outputs are merged back into one vector at the end.

Why do the heads specialize at all?

At first glance this seems strange. All heads receive the same input. Why would they learn different things? Wouldn’t they all do the same thing?

The answer lies in two mechanisms working together:

1. Different starting points. Each head gets its own weight matrices, all three of them per head, so W_Q, W_K, W_V. During initialization these matrices are filled with small random numbers, each different. So the heads already start at different places in the solution space.

2. Shared responsibility during learning. The outputs of all heads are concatenated and feed jointly into the next stage. If two heads tried to learn the same pattern during training, one of them would be redundant, the loss couldn’t decrease any further. The backpropagation signal automatically pushes the heads into different niches, because that reduces the loss most. No one tells them: “you do subject-verb relations.” They find their tasks themselves, because distribution works better than duplication.

What do these specializations look like?

Visualizations from the original paper and tools like BertViz make this visible. Here are four typical patterns observed in trained transformers, on the same example sentence:

What we see: Head 1 has bright cells mainly where subjects and verbs are, Anna → saw, she → petted. Head 2 resolves pronouns, she → Anna, it → dog. Head 3 shows a diagonal structure, every token attends mainly to its direct neighbors (useful for recognizing local phrases). Head 4 has scattered, distant references, the verb petted for example connects with it at the end of the sentence.

These patterns aren’t hard-coded. They simply emerge during training because they pay off. That’s what makes multi-head attention so powerful: the network gets a kind of automatic linguistics for free, without anyone having to tell it what subject-verb relations or coreferences are.

In modern LLMs 32, 64 or more heads are typical. Llama 3 405B has 128 heads per layer. Each layer has hundreds of parallel searches, stacked through dozens of layers. That’s where the enormous representational power of modern models comes from.

Causal Masking, Don’t Look at the Future

An important constraint for language models: when predicting the next word, we obviously can’t look into the future. When predicting token 5, the model may only see tokens 1 through 4, not 6, 7, 8.

This is solved with a simple mask before the softmax. All entries of the score matrix that point into the future, that is all entries above the main diagonal, are set to negative infinity. After softmax these become exactly zeros, so the token gets no attention on tokens that come after it.

# Lower triangular matrix, ones below, zeros above
mask = np.tril(np.ones((n, n)))
scores = np.where(mask == 1, scores, -np.inf)
weights = softmax(scores, axis=-1)

When does the mask actually matter?

The mask isn’t just an inference safety net, it’s primarily a training necessity.

During training: strictly required. Here the model processes the entire sentence in parallel in a single forward pass. For every position, the next-token prediction is computed simultaneously, and the loss is aggregated over all positions. Without the mask, position 1 (Anna) would already be able to see position 2 (saw), and the model would simply copy the answer instead of predicting it. This very parallelism is the reason transformers are so efficiently trainable in the first place; before 2017 with RNNs, you had to step through token by token. The mask is the trick that allows parallelism and causality at the same time.

During inference: technically redundant, but still there in practice. When generating, the model writes token by token, the future doesn’t even exist yet. Logically, the mask wouldn’t be needed. It stays in for two reasons: first, consistency with training behavior (otherwise the learned weights don’t match numerically), and second, parallel processing also happens at inference time. During prompt prefill ("Write me a poem about...") the entire prompt is loaded into the KV cache at once, and when batching with padded sequences, the padding has to be masked out anyway.

That’s the only difference between an encoder model like BERT (sees the whole sentence, no mask) and a decoder model like GPT (only sees the past, with mask). And that’s also why GPT-style models are well-suited for autoregressive text generation, every token is computed strictly from what came before it.

What Attention Actually Solves

We have all the building blocks now. What did Attention concretely change compared to RNNs?

Long-range dependencies are directly accessible. In an RNN, information between two distant tokens had to flow through every intermediate step, each with loss. In attention a single matrix operation suffices, no matter how far apart two tokens are. Distance no longer matters.

Parallelization comes naturally. Self-attention for the whole sequence is one big matrix multiplication, perfect for GPUs. While an RNN needs 1,000 sequential steps for a 1,000-token sequence, self-attention does it in one go. Training and inference accelerate by orders of magnitude, which is what made models with billions of parameters trainable in the first place.

Interpretability as a bonus. The attention matrix is readable. Unlike the hidden state of an RNN, which is a hard-to-decipher mixed soup, with attention you can see directly which token looked where. It’s not perfect, attention is not the same as understanding, but it’s a window RNNs didn’t have.

The price: quadratic in sequence length. Q · Kᵀ is an n × n matrix. At 1,000 tokens that’s a million entries, at 100,000 tokens (long context) it’s ten billion. Memory and compute grow quadratically. That’s the main reason why “long context” is a research field of its own in modern models (keywords: Flash Attention, Sliding Window Attention, State Space Models like Mamba).

What’s Still Missing

We now have a mechanism that lets every token see every other one in the context. That alone isn’t a language model yet. Three things are still missing:

Position. Self-attention itself is permutation-invariant. If we swap the tokens, the result barely changes. But language lives on order. “Dog bites man” is not “man bites dog.” For the model to understand order we need positional encodings, an extra signal that tells each embedding: you’re token 1, you’re token 2, you’re token 17.

Depth. A single attention layer makes one mix. But complex language needs several mixes stacked. Real language models stack dozens of transformer blocks, each with an attention and a feed-forward layer.

Stability. Deep stacks struggle with vanishing gradients from Article 5. The remedy is residual connections and layer normalization, both inherited from the ResNet world.

These three ingredients combined with attention make a transformer. That’s the topic of Article 7. We’ll build a complete transformer block, stack several into a real language model, and look at how this construction became GPT, BERT, Llama and Claude.

Translated with the help of Claude.

All articles in the series

The next word, how language models work
Words as points in space, what embeddings really are
Neural networks from scratch
Backpropagation, how a model learns
Context and RNNs, why order matters
Attention, the mechanism that changed everything ← this article
The transformer, the full architecture (coming soon)
Fine-tuning, from base model to assistant (coming soon)

Series: How LLMs Work · rotecodefraktion.de