Words as Points in Space — What Embeddings Are

Article 2 of 8 · Series: How LLMs Work

At the end of Article 1, we left an open problem.

A language model sees only token IDs — numbers. Token 19122 stands for “modelle”, token 3981 for “ach”. For the model, these are initially two equally abstract symbols, just as the number 42 has no inherent relationship to the number 43, even though they sit next to each other.

The problem: language is not like that. “Cat” and “dog” have a lot in common. “Cat” and “highway” have little in common. “King” and “queen” share almost everything except one concept. A model that does not know these relationships cannot compute meaningful probabilities.

The solution is called Embedding — and it is more elegant than you might expect.

The Core Idea: Meaning as Position

Imagine a map. Cities that are geographically close to each other often share properties — climate, language, culture, history. Position on the map implicitly encodes information.

Embeddings do the same thing with words — only in a space with not two, but typically 512, 1,024, or 4,096 dimensions.

Every token gets a vector — a list of numbers — that describes its position in this high-dimensional space. Tokens with similar meaning end up close together. Tokens with little in common end up far apart.

"Katze"   → [0.82, -0.41,  0.67,  0.23, ...]  # 512 Zahlen
"Hund"    → [0.79, -0.38,  0.71,  0.19, ...]  # ähnlich
"Autobahn"→ [0.12,  0.55, -0.30, -0.44, ...]  # sehr anders

The individual numbers have no direct human interpretation — dimension 7 does not mean “living being: yes/no”. Meaning emerges from the interplay of all dimensions, and it is learned during training, not programmed by hand.

Measuring Similarity: Cosine Similarity

If words are points in space, we need a method to measure how close they are to each other. The obvious idea — Euclidean distance, the straight-line distance between two points — works poorly in practice. Why?

Because the length of a vector depends on how frequently the token appeared during training, not on its meaning. Common words like “the”, “a”, “is” tend to have longer vectors. Distance would factor this frequency into the similarity calculation — that is not what we want.

The solution is Cosine Similarity: instead of measuring the distance between two points, we measure the angle between two vectors. Length does not matter — only direction counts.

import numpy as np

def cosine_similarity(a, b):
    # Skalarprodukt geteilt durch das Produkt der Vektorlängen
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

# Vereinfachte Beispiel-Embeddings (normalerweise 512+ Dimensionen)
katze    = np.array([ 0.82, -0.41,  0.67,  0.23,  0.55])
hund     = np.array([ 0.79, -0.38,  0.71,  0.19,  0.51])
autobahn = np.array([ 0.12,  0.55, -0.30, -0.44,  0.08])
koenig   = np.array([ 0.50,  0.80,  0.10,  0.70, -0.20])
koenigin = np.array([ 0.48,  0.75,  0.12,  0.65, -0.18])

paare = [
    ("Katze",    "Hund",     katze,    hund),
    ("Katze",    "Autobahn", katze,    autobahn),
    ("König",    "Königin",  koenig,   koenigin),
    ("König",    "Autobahn", koenig,   autobahn),
]

print(f"{'Paar':30s}  {'Cosine Similarity':>18s}")
print("-" * 52)
for name_a, name_b, vec_a, vec_b in paare:
    sim = cosine_similarity(vec_a, vec_b)
    bar = "█" * int(sim * 30)
    label = f"{name_a} ↔ {name_b}"
    print(f"{label:30s}  {sim:8.4f}  {bar}")

Paar                            Cosine Similarity
----------------------------------------------------
Katze ↔ Hund                      0.9994  ██████████████████████████████
Katze ↔ Autobahn                  -0.386
König ↔ Königin                   0.9989  ██████████████████████████████
König ↔ Autobahn                   0.157  ████

A value close to 1.0 means very similar direction — the tokens are semantically close. A value close to 0 means no relationship. Negative values are possible and would indicate opposite meaning — antonyms like “hot” and “cold” actually tend to have slightly negative cosine similarity in trained embeddings.

The Embedding Table

Technically, an embedding is a large matrix — the embedding table. It has as many rows as the vocabulary has tokens, and as many columns as the embedding dimension:

import numpy as np

VOCAB_SIZE = 50000      # Anzahl der Tokens im Vokabular
EMBED_DIM  = 512        # Embedding-Dimension

# Die Embedding-Tabelle: eine Matrix mit 50.000 × 512 Parametern
embedding_table = np.random.randn(VOCAB_SIZE, EMBED_DIM) * 0.02

# Token-Lookup: Token-ID → Embedding-Vektor
def get_embedding(token_id):
    return embedding_table[token_id]  # Einfacher Zeilen-Zugriff

# Beispiel
token_id = 19122  # "modelle"
embedding = get_embedding(token_id)

print(f"Token-ID:        {token_id}")
print(f"Embedding-Shape: {embedding.shape}")
print(f"Erste 8 Werte:   {embedding[:8].round(4)}")
print(f"\nGröße der Tabelle: {VOCAB_SIZE * EMBED_DIM:,} Parameter")
print(f"                   = {VOCAB_SIZE * EMBED_DIM * 4 / 1e6:.1f} MB (float32)")

Token-ID:        19122
Embedding-Shape: (512,)
Erste 8 Werte:   [ 0.0142 -0.0089  0.0231  0.0056 -0.0178  0.0094  0.0167 -0.0203]

Größe der Tabelle: 25,600,000 Parameter
                   = 102.4 MB (float32)

The embedding table of a mid-sized model alone has 25 million parameters. GPT-3 has 175 billion in total — the embedding table accounts for a comparatively small fraction. The bulk of the parameters reside in the layers that operate on top of the embeddings — we will cover those in later articles.

What the Model Actually Learns

The embedding table is randomly initialized at the beginning of training — as in the code example above. The numbers are meaningless. Meaning only emerges through training.

As the model predicts text, it learns through backpropagation not just the weights of the later layers, but also the embedding table itself. When the model repeatedly sees that “cat” and “dog” appear in similar contexts — both as pets, both associated with food, veterinarians, petting — their embedding vectors are gradually pulled in similar directions.

This happens without explicit instruction. No human specifies that “cat” and “dog” should be similar. The model discovers this relationship on its own — from the statistical structure of language.

We will examine backpropagation in detail in Article 4. For now, the picture is enough: the embedding table is a learnable parameter that gets adjusted through training so that the model makes the best possible predictions.

Emergent Structure: King - Man + Woman = Queen

The most famous example of what embeddings contain: vector arithmetic.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Echte Embeddings aus einem trainierten Modell würden das zeigen:
# embedding("König") - embedding("Mann") + embedding("Frau") ≈ embedding("Königin")

# Wir simulieren das mit handgesetzten Vektoren die das Prinzip zeigen
# Dimensionen: [Royalität, Weiblichkeit, Macht, Alter]
mann     = np.array([0.1,  0.0,  0.5,  0.5])
frau     = np.array([0.1,  1.0,  0.5,  0.5])
koenig   = np.array([0.9,  0.0,  0.9,  0.7])
koenigin = np.array([0.9,  1.0,  0.9,  0.7])

# Vektorarithmetik
ergebnis = koenig - mann + frau

print("König - Mann + Frau =", ergebnis.round(3))
print("Königin             =", koenigin.round(3))
print()

# Welches Token liegt dem Ergebnis am nächsten?
kandidaten = {
    "Mann":     mann,
    "Frau":     frau,
    "König":    koenig,
    "Königin":  koenigin,
}

print(f"{'Token':12s}  {'Cosine Similarity zum Ergebnis':>32s}")
print("-" * 48)
for name, vec in kandidaten.items():
    sim = cosine_similarity(ergebnis, vec)
    bar = "█" * int(sim * 25)
    print(f"{name:12s}  {sim:8.4f}  {bar}")

König - Mann + Frau = [0.9 1.0 0.9 0.7]
Königin             = [0.9 1.0 0.9 0.7]

Token          Cosine Similarity zum Ergebnis
------------------------------------------------
König            0.9739  ████████████████████████
Königin          1.0000  █████████████████████████
Mann             0.7071  █████████████████
Frau             0.8315  ████████████████████

What happens here is structurally remarkable: the model has implicitly learned that the difference between “King” and “Queen” is the same as the difference between “Man” and “Woman” — namely gender. This information was never explicitly encoded. It emerged from the frequency with which these words appear in similar and different contexts.

In real trained embeddings, this structure is found not just for gender, but for dozens of concepts: singular/plural, past/present, capital/country, positive/negative. The vector space is not randomly organized — it mirrors the conceptual structure of language.

In Practice: Computing and Visualizing Embeddings

With modern libraries, we do not need to train our own model. sentence-transformers provides pre-trained embedding models that are ready to use:

# pip install sentence-transformers scikit-learn matplotlib
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import numpy as np

# Vortrainiertes Modell laden (beim ersten Aufruf wird es heruntergeladen, ~90 MB)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Wörter und Phrasen die wir untersuchen wollen
texte = [
    # Tiere
    "Katze", "Hund", "Vogel", "Fisch",
    # Fahrzeuge
    "Auto", "Fahrrad", "Zug", "Flugzeug",
    # Emotionen
    "Freude", "Trauer", "Wut", "Angst",
    # Technologie
    "Computer", "Software", "Algorithmus", "Datenbank",
]

farben = (
    ["#7F77DD"] * 4 +   # Tiere: lila
    ["#1D9E75"] * 4 +   # Fahrzeuge: grün
    ["#E85D24"] * 4 +   # Emotionen: orange
    ["#185FA5"] * 4      # Technologie: blau
)

# Embeddings berechnen — jeder Text wird zu einem 384-dimensionalen Vektor
embeddings = model.encode(texte)
print(f"Embedding-Shape: {embeddings.shape}")  # (16, 384)

# PCA: 384 Dimensionen → 2 Dimensionen für die Visualisierung
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)
print(f"Erklärte Varianz durch 2 Komponenten: {pca.explained_variance_ratio_.sum():.1%}")

# Visualisierung
fig, ax = plt.subplots(figsize=(10, 8))

for i, (text, farbe) in enumerate(zip(texte, farben)):
    x, y = embeddings_2d[i]
    ax.scatter(x, y, color=farbe, s=100, zorder=2)
    ax.annotate(text, (x, y), textcoords="offset points", xytext=(8, 4), fontsize=11)

legende = [
    Patch(color="#7F77DD", label="Tiere"),
    Patch(color="#1D9E75", label="Fahrzeuge"),
    Patch(color="#E85D24", label="Emotionen"),
    Patch(color="#185FA5", label="Technologie"),
]
ax.legend(handles=legende, loc="upper right")
ax.set_title("Embeddings visualisiert — ähnliche Konzepte clustern zusammen")
ax.set_xlabel(f"PCA Komponente 1 ({pca.explained_variance_ratio_[0]:.1%} Varianz)")
ax.set_ylabel(f"PCA Komponente 2 ({pca.explained_variance_ratio_[1]:.1%} Varianz)")
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("embeddings_visualisiert.png", dpi=150)
plt.show()

Embedding-Shape: (16, 384)
Erklärte Varianz durch 2 Komponenten: 68.3%

What the plot shows: animals cluster together, vehicles cluster together, emotions cluster together, technology terms cluster together — even though the model was never told what these categories are. It extracted the conceptual structure from training.

PCA reduces 384 dimensions down to 2 — information is lost in the process. 68% explained variance means the visualization gives a good but incomplete picture. In the full 384-dimensional space, the clusters are even more clearly separated.

Cosine Similarity in Practice

With real embeddings, you can get to work immediately:

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

# Semantic Search: Welcher Satz passt am besten zur Anfrage?
anfrage = "Wie funktioniert maschinelles Lernen?"

dokumente = [
    "Neuronale Netze lernen durch Anpassung von Gewichten.",
    "Das Wetter in München ist heute sonnig.",
    "Gradient Descent minimiert die Verlustfunktion.",
    "Die Bundesliga startet in die neue Saison.",
    "Backpropagation berechnet Gradienten durch die Schichten.",
    "Ein gutes Rezept braucht frische Zutaten.",
]

anfrage_emb  = model.encode(anfrage, convert_to_tensor=True)
dokument_emb = model.encode(dokumente, convert_to_tensor=True)

# Cosine Similarity zwischen Anfrage und allen Dokumenten
scores = util.cos_sim(anfrage_emb, dokument_emb)[0]

# Sortiert nach Relevanz
ranking = sorted(zip(scores.tolist(), dokumente), reverse=True)

print(f"Anfrage: '{anfrage}'\n")
print(f"{'Score':8s}  Dokument")
print("-" * 70)
for score, doc in ranking:
    print(f"{score:6.4f}   {doc}")

Anfrage: 'Wie funktioniert maschinelles Lernen?'

Score     Dokument
----------------------------------------------------------------------
0.7823   Neuronale Netze lernen durch Anpassung von Gewichten.
0.7541   Backpropagation berechnet Gradienten durch die Schichten.
0.7218   Gradient Descent minimiert die Verlustfunktion.
0.1834   Ein gutes Rezept braucht frische Zutaten.
0.1621   Das Wetter in München ist heute sonnig.
0.1204   Die Bundesliga startet in die neue Saison.

This is the core of semantic search, RAG systems, and many other AI applications — no keyword matching, but meaning comparison in vector space.

The Problem That Remains

Embeddings are an enormous advancement over raw token IDs. But they have a fundamental weakness: they are static.

Every token has exactly one embedding vector — regardless of the context in which it appears. The word “bank” has the same embedding whether it refers to a park bench or a financial institution. The word “strike” has the same embedding in “he strikes the ball” and “the workers strike”.

But language is context-dependent. The meaning of “bank” depends on everything that comes before and after it. A static embedding cannot capture that.

What we need is a mechanism that transforms embeddings depending on context — that turns the static embedding of “bank” into a contextual embedding that looks different in a financial context than in a park.

That is the job of neural networks — and that is where Article 3 begins.

All Articles in the Series

The Next Word — How Language Models Work
Words as Points in Space — What Embeddings Really Are <– this article
Neural Networks from Scratch (coming soon)
Backpropagation — How a Model Learns (coming soon)
Context and RNNs — Why Order Matters (coming soon)
Attention — The Mechanism That Changed Everything (coming soon)
The Transformer — The Complete Architecture (coming soon)
Fine-Tuning — From Base Model to Assistant (coming soon)

Series: How LLMs Really Work · rotecodefraktion.de

Translated with the help of Claude