Test Data, Because Real Data Is Off the Table

Test Data, Because Real Data Is Off the Table

Article 3 · Series: Getting Started with n8n

The demo project classifies support tickets. That requires tickets — realistic ones, in sufficient quantity, with a known category distribution. Real tickets from a production system are out for a public repo: they contain personal data, system configurations, document numbers, sometimes even passwords in plain text. That is not a compliance disclaimer performed out of obligation. It is a genuine content problem. The dataset for this article is on Codeberg, tag v0.3: codeberg.org/rotecodefraktion/n8n-einstieg.

The Options Compared

Four candidates exist when real data is unavailable:

OptionProblem
Public customer support datasetsAlmost exclusively B2C telecom data in English — no SAP context, no DACH language
GitHub Issues as a sourceLegally grey (poster usage rights), wrong domain, already classified
Pure LLM generation without structurePrompting “generate 500 support tickets” produces 500 variations of the same ticket
Hybrid: LLM with categories and personasVariety comes from structure, not from chance

The fourth option is the one we use. The key is the distinction between variety and randomness. An LLM generating tickets without context reproduces its own bias. Persona and category as explicit inputs enforce diversity — tone, subject area, and urgency are controlled parameters, not a matter of hope.

Categories × Personas

The generator works with six categories and five personas. Each combination produces a different ticket.

Categories:

IDTopicKeywords
sap-basisSystem operations, transport, patchesSM50, STMS, kernel update, ABAP dump
sap-functionalFI, MM, SD, HR, processesAccounting document, purchase order, delivery
infrastrukturServer, network, backup, storageRAID, NFS, latency, backup failed
cloudAzure, Kubernetes, TerraformPod crash, subscription limit, state drift
security-pkiCertificates, permissions, CVECertificate expired, SU53, penetration test
sonstigesEverything else

Personas:

IDProfileTone
end-user-frustratedFrustrated end userShort sentences, no jargon, exclamation marks
key-user-urgentUrgent key userMuch context, SAP transactions, time pressure
admin-precisePrecise adminError codes, calm tone, complete details
manager-vagueVague managerLittle detail, high urgency
external-reporterExternal reporterFormal, incomplete information

Three sample tickets from the dataset to show the interaction:

sap-basis / de / end-user-frustrated:

“SAP login not possible after password reset. After the reset by IT I can no longer access the system. Error message: SM50 reports connection failure. This is completely blocking my work!”

sap-basis / de / key-user-urgent:

“Transport request failing in STMS. Transport DEVK900123 cannot be imported. Error in SM37: job RDDIMPDP terminates with dump. Patch level difference between DEV and QAS could be the cause.”

sap-basis / en / admin-precise:

“Kernel update causing SM50 work process restarts. After patch level upgrade to 7.93, work processes restart every 30 minutes. RFC connections drop simultaneously.”

Same category, three different tickets. The persona determines how someone writes, not what the problem is.

Local Model for the Bulk

About 85 percent of tickets are generated with a local model via Ollama. This saves API costs, creates no cloud dependency, and runs on Apple Silicon with Qwen 2.5 7B at around 30 tickets per minute.

The adapter in llm_local.py sends category, persona system prompt, and a selection of keywords to the Ollama REST API:

def is_hard_case(category: str, persona: str) -> bool:
    hard_categories = {"security-pki"}
    hard_personas = {"external-reporter"}
    return category in hard_categories or persona in hard_personas

Anyone without Ollama installed can start the generator with --no-local. Claude then handles all cases, or one simply uses the pinned dataset without regeneration.

For Apple Silicon users, MLX is an alternative to Ollama. llm_local.py contains a commented section with mlx_lm.generate() — switching is enough, the rest of the generator stays unchanged.

Hummingbird gateway as a unified backend

The Hummingbird gateway from the Hummingbird series on rotecodefraktion.de is a Swift-based LLM proxy that runs on Apple Silicon and exposes both the Anthropic format (/v1/messages) and the OpenAI format (/v1/chat/completions). Anyone running the gateway can operate the entire generator without direct Ollama or Anthropic API calls.

Switching llm_local.py to the gateway:

Instead of http://localhost:11434/api/generate (Ollama native API), use the gateway’s OpenAI-compatible endpoint:

GATEWAY_BASE = "http://localhost:8080"
DEFAULT_MODEL = "mlx-community/Qwen2.5-7B-Instruct-4bit"

# POST /v1/chat/completions instead of /api/generate
payload = json.dumps({
    "model": DEFAULT_MODEL,
    "messages": [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt}
    ],
    "temperature": 0.7,
    "seed": ticket_id
}).encode()
req = urllib.request.Request(
    f"{GATEWAY_BASE}/v1/chat/completions",
    data=payload,
    headers={"Content-Type": "application/json",
             "Authorization": "Bearer local"}
)

Switching llm_claude.py to the gateway:

The Anthropic Python SDK reads ANTHROPIC_BASE_URL from the environment. Point the variable at the gateway and the “Claude” calls also go through the local model:

export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=local   # any string, gateway checks its own token

The gateway handles the format translation: it receives the Anthropic request, converts it to OpenAI format, and forwards it to mlx_lm.server or Ollama.

Result: One local endpoint for all LLM calls, no cloud dependency, logging and rate limiting in the gateway, no Anthropic API key required. The generator behaves identically from the outside.

Claude for the Hard Cases

Two routing criteria send a ticket to Claude instead of the local model:

security-pki: Certificate terminology, CVE numbers, SU53 error patterns — 7B models frequently produce plausible-sounding but technically incorrect formulations here. Claude delivers noticeably denser texts in this category.

external-reporter: Formal ambiguity and deliberately incomplete information are hard for small models to imitate. The ticket should read as if written by someone who does not know what an SAP system is — but still hits a formal tone.

The llm_claude.py adapter requests structured JSON output:

prompt = (
    f"Write a support ticket (subject + 2-4 sentences) about {cat['label_en']}. "
    f"Priority: {priority}. Use: {', '.join(kw_sample)}. "
    f"Respond as JSON: {{\"subject\": \"...\", \"body\": \"...\"}}"
)

This makes the Claude output directly parseable without relying on regex. About 15 percent of tickets are generated this way. With 500 tickets, that is roughly 75 Claude calls, which costs under one dollar with claude-sonnet-4-6.

Validator as Quality Layer

Not every generated ticket is usable. The validator in validator.py checks:

  • Required fields present and non-empty
  • Category, priority, language within allowed value ranges
  • Subject at least 5 characters, body at least 10 characters
  • No generation artifacts in the subject: JSON fragments, prompt phrases such as “Write a” or “Respond as”

A ticket that does not pass the validator is discarded and the generator produces a replacement. In practice, this affects fewer than three percent of local outputs and under one percent of Claude outputs.

Tests with Hypothesis

The tests in testdata/generator/tests/ use property-based testing with Hypothesis. This is more natural for data generators than example-based tests because Hypothesis automatically searches for edge cases:

@given(st.fixed_dictionaries({
    "id": st.just("TKT-0001"),
    "subject": st.text(min_size=5, max_size=120),
    "body": st.text(min_size=10, max_size=1000),
    "category": st.sampled_from(sorted(VALID_CATEGORIES)),
    "priority": st.sampled_from(sorted(VALID_PRIORITIES)),
    "language": st.sampled_from(sorted(VALID_LANGUAGES)),
    "persona": st.just("admin-precise"),
    "generated_by": st.just("local"),
}))
def test_valid_ticket_passes(ticket):
    result = validate_ticket(ticket)
    assert result is not None

Distribution sanity checks round this out: no category below 5 percent in the dataset, no persona above 30 percent. The idempotency check verifies that the same seed produces the same output.

Tests run with uv run pytest testdata/generator/tests/.

Three ADRs as Architectural Memory

Three decisions received their own ADR:

ADR 001 — Synthetic instead of real: Rationale for synthetic data, rejected alternatives (public datasets, GitHub Issues). In the repo at docs/adr/001-testdatengenerierung.md.

ADR 002 — DE/EN at 60/40: Why not purely German. Multilingualism is not an extra feature but the condition under which rule-based classification (Article 4) visibly fails. In the repo at docs/adr/002-mehrsprachigkeit.md.

ADR 003 — Hybrid LLM, local plus Claude: Routing criteria and their rationale, cost estimate, rejected alternatives (purely local, purely Claude). In the repo at docs/adr/003-hybrid-llm-generierung.md.

ADRs are short — each under one page. Anyone adapting the generator for their own purposes will find the rationale for decisions that do not appear in the code.

Setting Up and Running the Generator

Four steps to generate the dataset yourself.

1. Clone the repo and check out tag v0.3:

git clone https://codeberg.org/rotecodefraktion/n8n-einstieg.git
cd n8n-einstieg
git checkout v0.3

2. Install uv and fetch dependencies:

# install uv (once)
curl -LsSf https://astral.sh/uv/install.sh | sh

# fetch dependencies
cd testdata/generator
uv sync

3. Start Ollama and pull the model:

ollama serve &
ollama pull qwen2.5:7b

On Apple Silicon, mlx_lm.server is an alternative — the commented section in llm_local.py describes the switch.

4. Run the generator:

uv run python -m generator --seed 42 --n 500

This writes testdata/tickets.parquet and testdata/tickets.jsonl. Runtime is between 15 and 30 minutes depending on the machine. For quicker results, --n 100 is enough for first experiments with the workflow in Article 4.

Without an Anthropic API key, add --no-claude. The roughly 15 percent of tickets that normally go to Claude are then also generated locally. Quality in security-pki and external-reporter is somewhat flatter, but sufficient for getting started.

The Pinned Dataset

testdata/tickets.parquet contains around 500 tickets generated with seed 42. Anyone who does not want to run the generator or does not have an Anthropic API key can work directly with this dataset — all subsequent articles require only the dataset, not the generator. Parquet as a format: typed, compact, fast to load with pandas or pyarrow. Since n8n cannot read Parquet, the same data is available as tickets.jsonl, one line per ticket.

Distribution in the pinned dataset:

CategoryShare
sap-functional~25%
sap-basis~20%
infrastruktur~20%
cloud~15%
security-pki~10%
sonstiges~10%

Language: ~60% German, ~40% English. Priority: ~30% low, ~40% medium, ~20% high, ~10% critical.

The dataset is published under CC0. Resemblance to real support cases is coincidental. To run your own experiments, use a different seed:

cd testdata/generator
uv run python -m generator --seed 99 --n 500 --output ../my-tickets.parquet

The pinned dataset under seed 42 stays unchanged — all subsequent articles test against the same state.

Next Article: The First Workflow

Article 4 takes this dataset and builds the first n8n workflow: rule-based ticket classification with Switch node, Set node, and Code node. No AI model, just keyword matching. And exactly there it becomes visible why 60/40 DE/EN in the dataset was the right decision. Tag v0.4.

Article 2: Self-Hosting with Docker Compose → Article 4: Nodes, Expressions, and the First Workflow (coming soon)