Context and RNNs — Why Order Matters

Article 5 of 8 · Series: How LLMs Work

When we read a sentence, we build up its meaning word by word. After the third word we know more than after the first, after the tenth yet more. Each new word gets its meaning only from the ones that came before. That is the fundamental nature of language: order and context are the meaning.

A small experiment: the word “bank”. On its own a riddle. Could be a place to sit, could be a financial institution. We place the sentence around it:

He needed cash and went to the bank.

He sat down on the bank of the river.

The same word, the same embedding vector from Article 2. But two completely different meanings. And the difference comes purely from what came before.

The model we built in Article 4 cannot handle this. It only ever sees a single token at a time. Input is exactly one embedding, output a prediction. The network has no idea what stood two sentences earlier, or even one word earlier.

So we need a network with memory. The first idea that actually worked in the deep-learning world was the Recurrent Neural Network, RNN for short. This article shows how RNNs work, where they shone, where they failed, and why today they have largely been replaced by the next idea, the Transformer.

The Core Idea: a Notebook for the Network

If we want to process language token by token, the network needs some place to note down what it has seen so far. A kind of short-term memory.

In an RNN, this memory is called the hidden state. We can think of it as a notebook that the network carries with it. At the first token, the notebook is empty. After each new token, the network writes onto it, and it also reads from it again before making its next prediction.

The computation that happens is conceptually very simple. At each time step:

Take the current token as input.
Take the old notebook.
Combine the two into a new notebook.
Derive the prediction from the new notebook.

Done. That is the entire principle. The complexity is hidden in step 3, where input and old state become a new state. But the mechanics are at their core a single step from Article 3: weighted sum plus activation function. Just with two inputs coming in instead of one.

What This Looks Like in Code

Before we dive into formulas, here is the code for a single RNN step. Three weight matrices, a bit of numpy, nothing more.

import numpy as np

# Three weight matrices the network will learn:
# W_xh: processes the current input
# W_hh: processes the notebook
# W_hy: derives the prediction from the notebook
#
# Plus two biases, as we know from Article 3.

def rnn_step(x, h):
    """A single RNN step."""
    h_new = np.tanh(x @ W_xh + h @ W_hh + b_h)   # new notebook
    y     = h_new @ W_hy + b_y                    # prediction
    return h_new, y

Three lines of math, that is the RNN cell. The first line is the core idea:

x @ W_xh: the current token is passed through a first weight matrix
h @ W_hh: the old notebook is passed through a second matrix
Both results are added, plus bias, plus activation

The addition is the trick. It fuses the current input with whatever came before. W_hh, the matrix that transforms the notebook, is the actual memory organ of the network. It decides how information travels from one step to the next, what is preserved and what is lost.

Now let’s run a short sequence through it:

np.random.seed(42)

d_in, d_hidden, d_out = 4, 8, 5
W_xh = np.random.randn(d_in,     d_hidden) * 0.3
W_hh = np.random.randn(d_hidden, d_hidden) * 0.3
W_hy = np.random.randn(d_hidden, d_out)    * 0.3
b_h, b_y = np.zeros(d_hidden), np.zeros(d_out)

# A sequence of 4 tokens (random embeddings)
sequence = np.random.randn(4, d_in)

# Notebook starts empty
h = np.zeros(d_hidden)

# Go through token by token
for t, x in enumerate(sequence):
    h, y = rnn_step(x, h)
    print(f"Step {t}: notebook[:3] = {h[:3].round(3)}")

Step 0: notebook[:3] = [ 0.082 -0.217  0.435]
Step 1: notebook[:3] = [-0.168  0.091  0.312]
Step 2: notebook[:3] = [ 0.273 -0.425  0.602]
Step 3: notebook[:3] = [-0.038  0.315  0.148]

The notebook changes with every step. And, this is the crucial bit, its content at step 3 depends not only on the fourth token, but also on the third, second, and first. Information flows through time.

The Same Cell, Again and Again

There’s something the code doesn’t make obvious at first glance: across all four steps we are using the same weight matrices. No fresh W_xh for step 2, no new W_hh for step 3. Always the same set.

This is a fundamental difference to a deep MLP from Article 3, where each layer would have its own weights. An RNN recycles its weights across time. Conceptually it is one cell, applied over and over to the same problem, token by token, until the sequence ends.

This is exactly why an RNN can handle sequences of arbitrary length. Whether the input is three tokens long or three hundred, the network needs no new parameters. It just runs in a loop a little longer.

In diagrams, RNNs are usually drawn two ways. On the left as a single cell with a feedback arrow to itself. On the right “unfolded” (English: unrolled), as a sequence of copies of the same cell, one per time step, connected by the notebook arrow. Both drawings mean the same network, the unrolled variant is just more useful for picturing training.

Training: Blame Travels Back in Time

Suppose we want the network to predict the next word at each step of a sentence. After the tenth word, it compares its prediction with the actual eleventh word, computes the error, and the game plays out as in Article 4: measure the error, compute gradients, update weights.

The difference is: the “network” through which the error is propagated back is now the unrolled version. Ten copies of the cell in a row. The algorithm is called Backpropagation Through Time, BPTT. Sounds like science fiction, but at its core it’s the same algorithm as in Article 4, just through a sequence instead of through stacked layers.

One peculiarity: because all time steps use the same weights, the gradient contributions from all time steps accumulate at the backward pass, for the same parameters. That isn’t a bug, it’s a feature. The network learns from every position in the sequence simultaneously, and that is exactly what makes its single RNN cell into something that can generalize over time.

∑ The gradients through time, more formally ▸

The hidden state h_t enters the total loss at two places: directly, through the output y_t at step t, and indirectly, because it feeds into the computation of h_{t+1} and thereby affects every later step too.

When propagating backward, these contributions sum up:

dL/dh_t  =  dL_t/dh_t  +  (dL/dh_{t+1}) · dh_{t+1}/dh_t

This is a recursive formula that we unroll from the end of the sequence back to the start.

For the weight matrix W_hh the analogous rule holds: the gradient is the sum of the contributions at each time step. The network learns from every position at once, and the updates consolidate into a single, shared matrix.

Now comes the place where RNNs failed in practice for years.

The Telephone-Game Problem

At every step, the notebook is transformed by the same matrix W_hh. In a sequence of 30 tokens, that transformation therefore happens 30 times in a row. During training, where the gradient flows backward through time, the gradient is likewise multiplied by (a variant of) W_hh each time.

That is the problem. Picture the telephone game: a message is whispered 30 times in a row. By the end, usually nothing is left of the original. Exactly the same thing happens with gradients through time.

If W_hh produces values on average smaller than 1, the gradient shrinks with every step. After 20 steps it’s practically zero. The network gets no training signal for inputs that lie far in the past. This is the famous vanishing gradient.

If the values are on average larger than 1, the gradient explodes instead. Updates go astronomical, training becomes numerically unstable, NaN values appear. Exploding gradient.

The practical outcome: a simple RNN learns short-range dependencies well. “What word follows a verb directly?”, no problem. Long-range dependencies however fail. “Which subject was introduced fifteen words ago?”, for that the memory simply isn’t enough. For language, this is a serious limitation. Sentences, paragraphs, documents constantly have references across long distances.

Exploding gradients can be handled with a pragmatic trick (you just limit the gradient size before each update, called gradient clipping). Vanishing gradients were the harder problem. They required an architectural rework.

LSTMs: a Notebook with Tabs

The solution came in 1997 from Sepp Hochreiter and Jürgen Schmidhuber. The paper was barely noticed at the time and is now considered one of the most important contributions to machine learning ever. The architecture is called Long Short-Term Memory, LSTM.

The idea sounds quite natural with our notebook metaphor. The problem with a plain RNN is that the whole notebook is rewritten at every step. If important older information accidentally gets painted over, it’s gone. And because this happens at every step, nothing survives for long.

An LSTM extends the notebook with a second layer: a cell state, a kind of long-term memory alongside the notebook. And three small decision makers called gates:

The forget gate decides for every slot in long-term memory: “delete or keep?”
The input gate decides: “is there new content to write in?”
The output gate decides how the current long-term memory becomes the new notebook

Each of these gates is itself a small neural network, and they learn their decisions from the data. The trick: the long-term memory flows through time without being completely remixed at every step. Information can survive over hundreds of steps, as long as the forget gate does not actively erase it. The telephone-game problem hasn’t completely disappeared, but it is significantly dampened.

In code an LSTM step looks like this (conceptually, slightly simplified):

def lstm_step(x, h, c):
    """Simplified LSTM step."""
    z = np.concatenate([x, h])

    f = sigmoid(z @ W_f + b_f)        # Forget gate
    i = sigmoid(z @ W_i + b_i)        # Input gate
    o = sigmoid(z @ W_o + b_o)        # Output gate
    g = np.tanh( z @ W_g + b_g)       # Candidate for writing

    c_new = f * c + i * g             # update long-term memory
    h_new = o * np.tanh(c_new)        # derive notebook
    return h_new, c_new

That looks like a lot, and it is. An LSTM has about four times as many parameters per cell as a plain RNN. The reward: sequences with a hundred or more tokens become trainable.

A related simplification is the GRU (Gated Recurrent Unit), introduced in 2014 by Kyunghyun Cho et al. It has only two gates and no separate long-term memory, yet achieves comparable results to LSTM on many tasks with fewer parameters. In practice the choice between the two was often a matter of taste.

The Golden Era: 2014 to 2017

With LSTMs (and a little later GRUs), RNNs conquered practically every NLP task between 2014 and 2017. The dominant pattern was called Sequence-to-Sequence (Ilya Sutskever et al): two RNNs, an encoder reads the input sequence and produces a final state, a decoder starts from that state and generates the output sequence. This pattern powered translators (Google Neural Machine Translation, 2016), image captioners, speech synthesis, and first language models.

Andrej Karpathy’s blog post “The Unreasonable Effectiveness of Recurrent Neural Networks” from 2015 was for many the moment when RNNs went from research topic to practically playable tool. He showed character-level RNNs that generated Shakespeare-like text, LaTeX pseudocode, or Linux-kernel-style C code. Silly? Sure. But they made visible: these networks actually learn structure, not just word frequencies.

At the same time, the limits became ever clearer. Two problems could not be worked around architecturally:

1. Long-range dependencies remained difficult. LSTMs were a huge improvement over plain RNNs. But they too struggled with references across hundreds of tokens. A sentence of 20 words was no problem. A 2000-word document was.

2. RNNs fit badly on GPUs. The sequential structure of an RNN forces you to finish processing token 1 before token 2 can even start. Before token 3 can start. And so on. Modern GPUs are built to carry out thousands of operations simultaneously. An architecture that simply doesn’t allow that barely uses such hardware. RNN training on long sequences took days to weeks.

Both problems had the same root: the network has to push information through a narrow and sequential channel. The hidden state (or cell state for LSTMs) is a single state variable of fixed size, through which all context must be squeezed, token by token in line.

The Escape: Just Look Everywhere

In 2015, Bahdanau, Cho, and Bengio introduced a new mechanism, initially as an extension for RNN translators: attention. The idea was that the decoder wouldn’t only look at its own notebook, but at every step of the output would actively look back at all positions of the input, weighted, and pick those currently relevant. Rather than squeezing information through a narrow channel, the decoder could simply pick ingredients from the whole sentence.

Two years later, in 2017, Vaswani and colleagues published the paper with the boldest title in NLP history: “Attention Is All You Need”. Their central claim: attention alone, without the recurrent substrate, works not only, but better. The architecture they presented is called Transformer and dispenses entirely with RNNs. A Transformer processes all tokens of a sequence in parallel. Relationships between positions are modelled by attention. Thereby both problems of the RNN era resolve themselves: long-range dependencies become structurally accessible, parallelization falls out naturally.

What attention actually is and how it works is the subject of Article 6. How a complete Transformer is assembled from it follows in Article 7.

What Remains of RNNs

Even though RNNs have been replaced by Transformers in the NLP mainstream today, they leave the field a few important lessons:

Sharing weights across time is a powerful principle that is still used in modern form today.
Vanishing gradients are not a special problem of RNNs. They appear in every deep network. The later fix via skip connections (known from ResNets) is conceptually closely related to the gates of LSTMs.
Recurrent architectures are not dead. Modern State Space Models like Mamba (2023) experiment again with recurrent structures, in a more economical form than classic RNNs but with the same basic idea. For certain problem classes, especially very long sequences, they are more efficient than Transformers.

What really took RNNs out of the race in the end wasn’t a single conceptual error, but the fundamental clash between sequential structure and parallel hardware. The Transformer resolved it by simply giving up sequential dependence. How that works, we’ll see in the next article.

All Articles in the Series

The Next Word — How Language Models Work
Words as Points in Space — What Embeddings Really Are
Neural Networks from Scratch
Backpropagation — How a Model Learns
Context and RNNs — Why Order Matters ← this article
Attention — The Mechanism That Changed Everything (coming soon)
The Transformer — The Complete Architecture (coming soon)
Fine-Tuning — From Base Model to Assistant (coming soon)

Series: How LLMs Work · rotecodefraktion.de

Translated from the German original with the help of Claude.