The Next Word — How Language Models Work

Article 1 of 8 · Series: How LLMs Work

What actually happens in the millisecond between an input and the first character of the response?

Most people have a rough idea — something to do with AI, something with neural networks, something involving a lot of computing power. That rough idea is enough to use a language model. It is not enough to understand why it behaves the way it does — why it sometimes makes things up, why minimal changes to the prompt can completely flip the answer, why it excels at certain tasks and fails at others in ways that would almost be funny if they were not so expensive.

This article lays the foundation. We will look at what a language model does mechanically — step by step, with real code, no shortcuts.

Not Words, but Tokens

The first thing most people get wrong: a language model does not read words. It reads tokens.

A token is a piece of text — sometimes a whole word, sometimes a word fragment, sometimes a single character or a punctuation mark. The exact segmentation depends on the tokenizer, but the principle is the same everywhere: text is split into manageable units before the model ever sees it.

Why not just use words? Three reasons.

First, the concept of a “word” is language-dependent and fuzzy — German compounds like “Donaudampfschifffahrtsgesellschaft” contain conceptually multiple words. Second, a purely word-based vocabulary would be enormous — millions of entries across all languages, conjugations, and technical terms. Tokens enable a compact vocabulary of typically 32,000 to 100,000 entries that can still represent any text. Third, sub-word tokens allow unknown words to be assembled from known parts — the model may see “quantum computer” for the first time, but already knows “quantum” and “computer” separately.

Let’s see what this looks like in practice:

import tiktoken

# pip install tiktoken
enc = tiktoken.get_encoding("cl100k_base")  # Vokabular von GPT-4

text = "Sprachmodelle sind faszinierend."
tokens = enc.encode(text)

print(f"Text:   {text}")
print(f"Tokens: {tokens}")
print(f"Anzahl: {len(tokens)}\n")

for token_id in tokens:
    print(f"  {token_id:6d} → '{enc.decode([token_id])}'")

Text:   Sprachmodelle sind faszinierend.
Tokens: [50, 3981, 19122, 1543, 7953, 13]
Anzahl: 6

    50 → 'Spr'
  3981 → 'ach'
 19122 → 'modelle'
  1543 → ' sind'
  7953 → ' faszin'
    13 → 'ierend.'

“Sprachmodelle” is split into three parts. “sind” stays as a whole word — but with a leading space as part of the token. This is not a bug; it is a design decision: spaces belong to the following token, not the preceding one.

Every token has a numeric ID. That is all the model sees — a sequence of numbers. No text, no meaning, no grammar. Just numbers.

A Distribution, Not an Answer

Here is the core of the matter, and it is simpler than it sounds.

A language model takes a sequence of token IDs as input and produces — for every possible next token — a probability. Not the answer. A probability distribution over all possible continuations.

With a vocabulary of 50,000 tokens, you get 50,000 probabilities. The sum of all probabilities is always 1.0. The model is essentially saying: “Given everything so far — with 34% probability the next token is ‘is’, with 12% ‘was’, with 8% ‘will’…” and so on through the entire vocabulary.

Internally, this happens via a softmax step. Softmax takes arbitrary numbers — the so-called logits, the raw model outputs — and turns them into a probability distribution:

import numpy as np

def softmax(logits):
    # Numerisch stabil: Maximum subtrahieren verhindert Overflow
    logits = logits - np.max(logits)
    exp_logits = np.exp(logits)
    return exp_logits / exp_logits.sum()

# Vereinfachtes Beispiel: Das Modell hat 5 mögliche nächste Tokens bewertet
# In der Praxis sind es 50.000+
token_texts = [" ist", " war", " wird", " hat", " kann"]
logits      = [  3.2,    1.8,    2.1,    1.2,   0.5 ]

probabilities = softmax(np.array(logits))

print(f"{'Token':12s}  {'Logit':6s}  {'Wahrsch.':10s}")
print("-" * 34)
for text, logit, prob in zip(token_texts, logits, probabilities):
    bar = "█" * int(prob * 40)
    print(f"{text:12s}  {logit:5.1f}   {prob:.4f}  {bar}")

Token          Logit   Wahrsch.
----------------------------------
 ist            3.2   0.4821  ████████████████████
 wird           2.1   0.1623  ██████
 war            1.8   0.1200  ████
 hat            1.2   0.0662  ██
 kann           0.5   0.0328  █

Two things stand out. First: the logit difference between “ist” (3.2) and “war” (1.8) is 1.4 — but the probability difference is massive: 48% versus 12%. Softmax amplifies differences exponentially. Small changes in logits lead to large shifts in the distribution.

Second: the model is never 100% certain. Even the most probable token is below 50% here. That means: more than half the time, something other than the most likely token will be chosen — unless you go deterministic. We will get to that shortly.

How a Token Is Chosen from the Distribution

The model has computed a distribution. Now it must select a token. How this happens has an enormous impact on the model’s behavior — and is the reason the same question can produce different answers.

Greedy Decoding

The most obvious strategy: always pick the token with the highest probability.

import numpy as np

token_texts = [" ist", " war", " wird", " hat", " kann"]
probabilities = [0.42, 0.18, 0.15, 0.14, 0.11]

def greedy_decode(probabilities, token_texts):
    best_idx = np.argmax(probabilities)
    return token_texts[best_idx]

next_token = greedy_decode(probabilities, token_texts)
print(f"Greedy: '{next_token}'")  # → ' ist'

Deterministic, reproducible — and problematic for longer texts in practice. Greedy Decoding tends toward repetitions and flat, predictable outputs. The model always picks the locally best option, which can lead to globally poor results — a classic greedy problem familiar from any algorithms course.

Temperature Sampling

The more elegant solution: reshape the probability distribution before sampling — using a parameter called Temperature.

import numpy as np
from collections import Counter

def softmax(logits):
    logits = np.array(logits) - np.max(logits)
    e = np.exp(logits)
    return e / e.sum()

def temperature_sample(logits, temperature, token_texts, n_samples=10000):
    # Temperature skaliert die Logits vor dem Softmax
    scaled_logits = np.array(logits) / temperature
    probs = softmax(scaled_logits)

    # Zufälliges Sampling aus der Verteilung
    chosen_indices = np.random.choice(len(token_texts), size=n_samples, p=probs)
    counts = Counter(chosen_indices)

    print(f"\nTemperature = {temperature}")
    print(f"{'Token':12s}  {'Wahrsch.':10s}  {'Gezogen':8s}")
    print("-" * 38)
    for idx, text in enumerate(token_texts):
        prob = probs[idx]
        drawn = counts.get(idx, 0) / n_samples
        bar = "█" * int(drawn * 30)
        print(f"{text:12s}  {prob:.4f}      {drawn:.4f}  {bar}")

token_texts = [" ist", " war", " wird", " hat", " kann"]
logits      = [  3.2,    1.8,    2.1,    1.2,   0.5 ]

temperature_sample(logits, temperature=0.2, token_texts=token_texts)
temperature_sample(logits, temperature=1.0, token_texts=token_texts)
temperature_sample(logits, temperature=2.0, token_texts=token_texts)

Temperature = 0.2
Token          Wahrsch.    Gezogen
--------------------------------------
 ist           0.9801      0.9795  ██████████████████████████████
 wird          0.0145      0.0143
 war           0.0050      0.0051
 hat           0.0003      0.0003
 kann          0.0001      0.0000

Temperature = 1.0
Token          Wahrsch.    Gezogen
--------------------------------------
 ist           0.4821      0.4834  ██████████████
 wird          0.1623      0.1618  ████
 war           0.1200      0.1204  ███
 hat           0.0662      0.0655  █
 kann          0.0328      0.0331  █

Temperature = 2.0
Token          Wahrsch.    Gezogen
--------------------------------------
 ist           0.2879      0.2887  ████████
 wird          0.2016      0.2014  ██████
 war           0.1789      0.1802  █████
 hat           0.1427      0.1415  ████
 kann          0.1116      0.1110  ███

At Temperature 0.2, one token dominates almost completely — the behavior approaches greedy. At Temperature 2.0, all tokens become nearly equally probable — the output gets more random, more creative, but also more error-prone. Temperature 1.0 leaves the model’s original distribution untouched.

Top-p Sampling

An extension that is often combined with Temperature in practice: Top-p — also known as Nucleus Sampling. Instead of including all tokens in the pool, Top-p selects only the most probable tokens — just enough until their cumulative probability reaches p.

import numpy as np

def softmax(logits):
    logits = np.array(logits) - np.max(logits)
    e = np.exp(logits)
    return e / e.sum()

def top_p_sample(probabilities, token_texts, p=0.9):
    probabilities = np.array(probabilities)
    # Tokens nach Wahrscheinlichkeit absteigend sortieren
    sorted_indices = np.argsort(probabilities)[::-1]
    sorted_probs = probabilities[sorted_indices]

    # Kumulative Summe — bis p erreicht ist
    cumulative_probs = np.cumsum(sorted_probs)
    cutoff_idx = np.searchsorted(cumulative_probs, p) + 1

    nucleus_indices = sorted_indices[:cutoff_idx]
    nucleus_probs = sorted_probs[:cutoff_idx]
    nucleus_probs = nucleus_probs / nucleus_probs.sum()  # Renormalisieren

    print(f"\nTop-p = {p} → {cutoff_idx} Tokens im Nucleus:")
    for idx, prob in zip(nucleus_indices, nucleus_probs):
        print(f"  {token_texts[idx]:12s}  {prob:.4f}")

    return token_texts[np.random.choice(nucleus_indices, p=nucleus_probs)]

token_texts = [" ist", " war", " wird", " hat", " kann"]
logits      = [  3.2,    1.8,    2.1,    1.2,   0.5 ]

probs = softmax(logits)
result = top_p_sample(probs, token_texts, p=0.9)
print(f"\nGezogen: '{result}'")

Top-p = 0.9 → 3 Tokens im Nucleus:
   ist          0.5891
   wird         0.1984
   war          0.1467
  (hat und kann werden ausgeschlossen)

Gezogen: ' ist'

The advantage over pure Temperature Sampling: the nucleus size adapts dynamically. When the model is very confident and one token has 95%, the nucleus is minimal. When the model is uncertain, the nucleus grows — but not without limits. Unlikely outliers are always excluded.

Strategies at a Glance

Strategy	Deterministic	Creativity	Typical Use Case
Greedy (temp = 0)	Yes	None	Factual queries, code completion
Temperature < 1	No	Low	Structured outputs, JSON
Temperature = 1	No	Medium	General chatbots
Temperature > 1	No	High	Creative writing, brainstorming
Top-p = 0.9	No	Context-dependent	Default in most APIs

In practice, almost all production systems use a combination: Temperature between 0.7 and 1.0, combined with Top-p between 0.9 and 0.95. Most API defaults fall exactly in this range.

What All of This Explains

Now a number of behaviors that previously seemed like black magic can be explained.

Hallucinations are not a bug — they are a direct consequence of the sampling process. The model chooses from a probability distribution at every step. If the correct token sits at 60% and an incorrect but plausible token at 15%, the wrong one gets picked roughly one in six times — and then serves as input for the next step, shifting the subsequent distribution. Errors accumulate across the sequence.

Prompt sensitivity is explained by the fact that minimal changes in the input text shift the logits — and softmax turns small logit differences into large probability differences. A different word in the prompt can produce an entirely different distribution.

Non-determinism is fundamental at Temperature > 0. The same model gives different answers to the same question — that is not a bug, that is the design. Reproducible outputs require Temperature 0 and a fixed random seed.

Long outputs drift because every generated token becomes part of the next input. Errors early in the output influence all subsequent tokens. This is why structured output and JSON mode exist in many APIs — they constrain the sampling space so the model does not deviate from the desired format.

What Is Still Missing

At this point, we have described what a language model does: it takes a sequence of token IDs, computes a probability distribution over the entire vocabulary, and selects the next token.

What we have not described: how does the model arrive at these probabilities in the first place? What happens between the token ID sequence and the logits?

That is the question of internal representation — and the answer begins with a problem. The model sees only numbers: token 19122 is just as abstract to the model as token 3981. To compute meaningful probabilities, the model needs a way to encode meaning into these numbers. It needs a way to know that “cat” and “dog” have more in common than “cat” and “highway”.

That is the topic of Article 2: Embeddings — how tokens are projected into a high-dimensional space where meaning becomes measurable.

All Articles in the Series

The Next Word — How Language Models Work <– this article
Words as Points in Space — What Embeddings Really Are
Neural Networks from Scratch (coming soon)
Backpropagation — How a Model Learns (coming soon)
Context and RNNs — Why Order Matters (coming soon)
Attention — The Mechanism That Changed Everything (coming soon)
The Transformer — The Complete Architecture (coming soon)
Fine-Tuning — From Base Model to Assistant (coming soon)

Series: How LLMs Really Work · rotecodefraktion.de

Translated with the help of Claude