Test Data, Because Real Data Is Off the Table
Article 3 · Series: Getting Started with n8n
The demo project classifies support tickets. That requires tickets — realistic ones, in sufficient quantity, with a known category distribution. Real tickets from a production system are out for a public repo: they contain personal data, system configurations, document numbers, sometimes even passwords in plain text. That is not a compliance disclaimer performed out of obligation. It is a genuine content problem. The dataset for this article is on Codeberg, tag v0.3: codeberg.org/rotecodefraktion/n8n-einstieg.
The Options Compared
Four candidates exist when real data is unavailable:
| Option | Problem |
|---|---|
| Public customer support datasets | Almost exclusively B2C telecom data in English — no SAP context, no DACH language |
| GitHub Issues as a source | Legally grey (poster usage rights), wrong domain, already classified |
| Pure LLM generation without structure | Prompting “generate 500 support tickets” produces 500 variations of the same ticket |
| Hybrid: LLM with categories and personas | Variety comes from structure, not from chance |
The fourth option is the one we use. The key is the distinction between variety and randomness. An LLM generating tickets without context reproduces its own bias. Persona and category as explicit inputs enforce diversity — tone, subject area, and urgency are controlled parameters, not a matter of hope.
Categories × Personas
The generator works with six categories and five personas. Each combination produces a different ticket.
Categories:
| ID | Topic | Keywords |
|---|---|---|
sap-basis | System operations, transport, patches | SM50, STMS, kernel update, ABAP dump |
sap-functional | FI, MM, SD, HR, processes | Accounting document, purchase order, delivery |
infrastruktur | Server, network, backup, storage | RAID, NFS, latency, backup failed |
cloud | Azure, Kubernetes, Terraform | Pod crash, subscription limit, state drift |
security-pki | Certificates, permissions, CVE | Certificate expired, SU53, penetration test |
sonstiges | Everything else | — |
Personas:
| ID | Profile | Tone |
|---|---|---|
end-user-frustrated | Frustrated end user | Short sentences, no jargon, exclamation marks |
key-user-urgent | Urgent key user | Much context, SAP transactions, time pressure |
admin-precise | Precise admin | Error codes, calm tone, complete details |
manager-vague | Vague manager | Little detail, high urgency |
external-reporter | External reporter | Formal, incomplete information |
Three sample tickets from the dataset to show the interaction:
sap-basis / de / end-user-frustrated:
“SAP login not possible after password reset. After the reset by IT I can no longer access the system. Error message: SM50 reports connection failure. This is completely blocking my work!”
sap-basis / de / key-user-urgent:
“Transport request failing in STMS. Transport DEVK900123 cannot be imported. Error in SM37: job RDDIMPDP terminates with dump. Patch level difference between DEV and QAS could be the cause.”
sap-basis / en / admin-precise:
“Kernel update causing SM50 work process restarts. After patch level upgrade to 7.93, work processes restart every 30 minutes. RFC connections drop simultaneously.”
Same category, three different tickets. The persona determines how someone writes, not what the problem is.
Local Model for the Bulk
About 85 percent of tickets are generated with a local model via Ollama. This saves API costs, creates no cloud dependency, and runs on Apple Silicon with Qwen 2.5 7B at around 30 tickets per minute.
The adapter in llm_local.py sends category, persona system prompt, and a selection of keywords to the Ollama REST API:
def is_hard_case(category: str, persona: str) -> bool:
hard_categories = {"security-pki"}
hard_personas = {"external-reporter"}
return category in hard_categories or persona in hard_personas
Anyone without Ollama installed can start the generator with --no-local. Claude then handles all cases, or one simply uses the pinned dataset without regeneration.
For Apple Silicon users, MLX is an alternative to Ollama. llm_local.py contains a commented section with mlx_lm.generate() — switching is enough, the rest of the generator stays unchanged.
Hummingbird gateway as a unified backend
The Hummingbird gateway from the Hummingbird series on rotecodefraktion.de is a Swift-based LLM proxy that runs on Apple Silicon and exposes both the Anthropic format (/v1/messages) and the OpenAI format (/v1/chat/completions). Anyone running the gateway can operate the entire generator without direct Ollama or Anthropic API calls.
Switching llm_local.py to the gateway:
Instead of http://localhost:11434/api/generate (Ollama native API), use the gateway’s OpenAI-compatible endpoint:
GATEWAY_BASE = "http://localhost:8080"
DEFAULT_MODEL = "mlx-community/Qwen2.5-7B-Instruct-4bit"
# POST /v1/chat/completions instead of /api/generate
payload = json.dumps({
"model": DEFAULT_MODEL,
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
"temperature": 0.7,
"seed": ticket_id
}).encode()
req = urllib.request.Request(
f"{GATEWAY_BASE}/v1/chat/completions",
data=payload,
headers={"Content-Type": "application/json",
"Authorization": "Bearer local"}
)
Switching llm_claude.py to the gateway:
The Anthropic Python SDK reads ANTHROPIC_BASE_URL from the environment. Point the variable at the gateway and the “Claude” calls also go through the local model:
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=local # any string, gateway checks its own token
The gateway handles the format translation: it receives the Anthropic request, converts it to OpenAI format, and forwards it to mlx_lm.server or Ollama.
Result: One local endpoint for all LLM calls, no cloud dependency, logging and rate limiting in the gateway, no Anthropic API key required. The generator behaves identically from the outside.
Claude for the Hard Cases
Two routing criteria send a ticket to Claude instead of the local model:
security-pki: Certificate terminology, CVE numbers, SU53 error patterns — 7B models frequently produce plausible-sounding but technically incorrect formulations here. Claude delivers noticeably denser texts in this category.
external-reporter: Formal ambiguity and deliberately incomplete information are hard for small models to imitate. The ticket should read as if written by someone who does not know what an SAP system is — but still hits a formal tone.
The llm_claude.py adapter requests structured JSON output:
prompt = (
f"Write a support ticket (subject + 2-4 sentences) about {cat['label_en']}. "
f"Priority: {priority}. Use: {', '.join(kw_sample)}. "
f"Respond as JSON: {{\"subject\": \"...\", \"body\": \"...\"}}"
)
This makes the Claude output directly parseable without relying on regex. About 15 percent of tickets are generated this way. With 500 tickets, that is roughly 75 Claude calls, which costs under one dollar with claude-sonnet-4-6.
Validator as Quality Layer
Not every generated ticket is usable. The validator in validator.py checks:
- Required fields present and non-empty
- Category, priority, language within allowed value ranges
- Subject at least 5 characters, body at least 10 characters
- No generation artifacts in the subject: JSON fragments, prompt phrases such as “Write a” or “Respond as”
A ticket that does not pass the validator is discarded and the generator produces a replacement. In practice, this affects fewer than three percent of local outputs and under one percent of Claude outputs.
Tests with Hypothesis
The tests in testdata/generator/tests/ use property-based testing with Hypothesis. This is more natural for data generators than example-based tests because Hypothesis automatically searches for edge cases:
@given(st.fixed_dictionaries({
"id": st.just("TKT-0001"),
"subject": st.text(min_size=5, max_size=120),
"body": st.text(min_size=10, max_size=1000),
"category": st.sampled_from(sorted(VALID_CATEGORIES)),
"priority": st.sampled_from(sorted(VALID_PRIORITIES)),
"language": st.sampled_from(sorted(VALID_LANGUAGES)),
"persona": st.just("admin-precise"),
"generated_by": st.just("local"),
}))
def test_valid_ticket_passes(ticket):
result = validate_ticket(ticket)
assert result is not None
Distribution sanity checks round this out: no category below 5 percent in the dataset, no persona above 30 percent. The idempotency check verifies that the same seed produces the same output.
Tests run with uv run pytest testdata/generator/tests/.
Three ADRs as Architectural Memory
Three decisions received their own ADR:
ADR 001 — Synthetic instead of real: Rationale for synthetic data, rejected alternatives (public datasets, GitHub Issues). In the repo at docs/adr/001-testdatengenerierung.md.
ADR 002 — DE/EN at 60/40: Why not purely German. Multilingualism is not an extra feature but the condition under which rule-based classification (Article 4) visibly fails. In the repo at docs/adr/002-mehrsprachigkeit.md.
ADR 003 — Hybrid LLM, local plus Claude: Routing criteria and their rationale, cost estimate, rejected alternatives (purely local, purely Claude). In the repo at docs/adr/003-hybrid-llm-generierung.md.
ADRs are short — each under one page. Anyone adapting the generator for their own purposes will find the rationale for decisions that do not appear in the code.
Setting Up and Running the Generator
Four steps to generate the dataset yourself.
1. Clone the repo and check out tag v0.3:
git clone https://codeberg.org/rotecodefraktion/n8n-einstieg.git
cd n8n-einstieg
git checkout v0.3
2. Install uv and fetch dependencies:
# install uv (once)
curl -LsSf https://astral.sh/uv/install.sh | sh
# fetch dependencies
cd testdata/generator
uv sync
3. Start Ollama and pull the model:
ollama serve &
ollama pull qwen2.5:7b
On Apple Silicon, mlx_lm.server is an alternative — the commented section in llm_local.py describes the switch.
4. Run the generator:
uv run python -m generator --seed 42 --n 500
This writes testdata/tickets.parquet and testdata/tickets.jsonl. Runtime is between 15 and 30 minutes depending on the machine. For quicker results, --n 100 is enough for first experiments with the workflow in Article 4.
Without an Anthropic API key, add --no-claude. The roughly 15 percent of tickets that normally go to Claude are then also generated locally. Quality in security-pki and external-reporter is somewhat flatter, but sufficient for getting started.
The Pinned Dataset
testdata/tickets.parquet contains around 500 tickets generated with seed 42. Anyone who does not want to run the generator or does not have an Anthropic API key can work directly with this dataset — all subsequent articles require only the dataset, not the generator. Parquet as a format: typed, compact, fast to load with pandas or pyarrow. Since n8n cannot read Parquet, the same data is available as tickets.jsonl, one line per ticket.
Distribution in the pinned dataset:
| Category | Share |
|---|---|
sap-functional | ~25% |
sap-basis | ~20% |
infrastruktur | ~20% |
cloud | ~15% |
security-pki | ~10% |
sonstiges | ~10% |
Language: ~60% German, ~40% English. Priority: ~30% low, ~40% medium, ~20% high, ~10% critical.
The dataset is published under CC0. Resemblance to real support cases is coincidental. To run your own experiments, use a different seed:
cd testdata/generator
uv run python -m generator --seed 99 --n 500 --output ../my-tickets.parquet
The pinned dataset under seed 42 stays unchanged — all subsequent articles test against the same state.
Next Article: The First Workflow
Article 4 takes this dataset and builds the first n8n workflow: rule-based ticket classification with Switch node, Set node, and Code node. No AI model, just keyword matching. And exactly there it becomes visible why 60/40 DE/EN in the dataset was the right decision. Tag v0.4.
→ Article 2: Self-Hosting with Docker Compose → Article 4: Nodes, Expressions, and the First Workflow (coming soon)