Writing Custom Skills — PDF Parser Skill with pytest TDD

Article 3 · Series: Agentic Coding with Claude Code

v0.2 produced the map: docs/architecture.md describes the PDF structure of the Bavarian budget plans down to column level, and CLAUDE.md has contained four working-principle rules since the last commit that govern the agent’s behavior during implementation. The next step is the first parser code.

For this step the agent needs two things: a clearly formulated task in the form of a custom skill, and a TDD loop that ensures the result is correct — not just that it runs.

What a Custom Skill Is

A custom skill is a Markdown file at .claude/skills/<name>/SKILL.md. The frontmatter holds name and description. The body describes the workflow, tools, and verification criteria.

Claude Code loads the skill automatically when the situation matches. Which situation matches is decided by the description field.

The trigger description is the most important line in the entire file.

# Bad — too generic:
description: Parses PDFs. Use when parsing PDFs.

This description says nothing actionable. The agent cannot tell when “parsing PDFs” describes the situation — with every Bash command that mentions a PDF path? When reading a log file? The skill either never triggers or triggers constantly.

# Good — domain-specific with trigger phrases:
description: Extracts Titelübersicht data from a Bavarian budget plan PDF (Einzelplan).
  Use this skill when parsing an Einzelplan, extracting titles, producing JSON output,
  or running tests against reference values from the Abschluss page.
  Triggers: "parse Epl", "extract titles", "run parser", "JSON for Epl".

Three elements: what the skill does, when it applies, concrete phrases as activation patterns. That is sufficient for reliable matching.

Writing the parse-haushalt Skill

skill-creator is the meta-skill for exactly this task: structuring skills, sharpening descriptions, writing evals. We invoked it like this:

Create a skill for byhaushalt: parse-haushalt.
The skill calls extract_titles() on Einzelplan PDFs and outputs JSON.
Domain: Bavarian budget plans, Titelübersicht pages.
Verification: uv run pytest tests/ -v must be green before every commit.

skill-creator produced a first skill file with a generic description:

description: Parses Bavarian budget PDFs. Use when parsing PDFs or extracting data.

The first field test revealed the problem: Claude Code invoked the skill when we asked it to “read through the architecture.md again”. The description matches too broadly. Second skill-creator call:

Improve the trigger description for parse-haushalt.
Problem: the skill triggered on "read the architecture docs" — that is not a parser task.
Goal: trigger only on concrete parser tasks.
Domain keywords: Einzelplan, Titelübersicht, Abschluss, JSON, extract_titles.
List trigger phrases explicitly.

The result is the version above — with domain keywords and an explicit trigger phrase list. Skills are not static documents written once and never touched again. A description that sounds good on first draft often triggers too narrowly or too broadly in practice. skill-creator is the tool for this iteration — both when creating and when sharpening after real-world use.

The verification step belongs in the skill body, not in CLAUDE.md. The reason: CLAUDE.md describes general conventions that apply to all tasks. Verification for the parser is domain-specific — uv run pytest tests/ -v, sum consistency, red tests mean no commit. That belongs in the skill that owns this task.

uv Setup

cd parser/
uv sync --extra dev

pyproject.toml pins the stack decided in ADR 001:

[project]
dependencies = ["pdfplumber>=0.11", "polars>=1.0"]

[project.optional-dependencies]
dev = ["pytest>=8.0", "hypothesis>=6.0"]

A uv sync on a fresh checkout installs exactly these versions from uv.lock — no network beyond the initial fetch, under one second for this stack. That is the reproducibility advantage from ADR 001: uv sync plus uv run as the only prerequisites.

TDD Loop

Before pdf_extract.py exists, we write the tests — not Claude Code.

The reason is the control point: the test is the spec. Whoever sets the criteria decides what “done” means. If Claude Code determines criteria and test code on its own, the agent defines its own target — and naturally writes tests that match its own implementation. Those are confirmation tests, not fault-finders.

The test code itself can come from Claude — assert abs(total - Decimal("47379.0")) <= Decimal("1.0") is formulation work. But the criterion behind it — total expenditure 47,379.0 Tsd. €, tolerance ±1 Tsd. €, taken from the Abschluss page (page 22) — comes from us. Claude Code does not know this value from training.

Without the criterion from us, Claude would have fallen back on assert len(titles) > 0 — a test that turns green as soon as anything is returned, regardless of whether the numbers are correct.

The split in practice: we supply the success criteria and reference values, Claude writes the test code against them, Claude implements until the tests are green.

Six tests, ordered by abstraction level:

def test_extract_titles_nonempty(epl11_titles):
    assert len(epl11_titles) > 0

def test_every_titel_has_required_fields(epl11_titles):
    required = {"titel_nr", "fkz", "kap_nr", "epl_nr", "zweckbestimmung", "soll_2026_tsd"}
    for t in epl11_titles:
        missing = required - set(vars(t).keys())
        assert not missing

def test_all_titles_belong_to_epl11(epl11_titles):
    wrong = [t for t in epl11_titles if t.epl_nr != "11"]
    assert not wrong

def test_kapitel_nummern_sind_subset_der_bekannten(epl11_titles):
    known_kap = {"11 01", "11 02", "11 04"}
    found_kap = {t.kap_nr for t in epl11_titles}
    assert not (found_kap - known_kap)

def test_gesamtausgaben_epl11_2026():
    titles = extract_titles(EPL11)
    ausgaben = [t for t in titles
                if t.soll_2026_tsd is not None and not t.titel_nr.startswith("1")]
    total = sum(t.soll_2026_tsd for t in ausgaben)
    assert abs(total - Decimal("47379.0")) <= Decimal("1.0")

def test_gesamteinnahmen_epl11_2026():
    titles = extract_titles(EPL11)
    einnahmen = [t for t in titles
                 if t.soll_2026_tsd is not None and t.titel_nr.startswith("1")]
    total = sum(t.soll_2026_tsd for t in einnahmen)
    assert abs(total - Decimal("11.9")) <= Decimal("1.0")

The reference values for the sum tests come from the Abschluss page of Epl11.pdf (page 22). Architecture.md has this page mapped. We read off: Gesamtausgaben 2026 = 47,379.0 Tsd. €, Gesamteinnahmen 2026 = 11.9 Tsd. €.

Step by Step

Step 1 — Have the tests written. Give Claude Code this prompt inside the parser/ directory:

Write parser/tests/test_pdf_extract.py with 6 tests for extract_titles().
Criteria:
- Return value is not empty
- Each title has: titel_nr, fkz, kap_nr, epl_nr, zweckbestimmung, soll_2026_tsd
- epl_nr is always "11"
- Only chapters 11 01, 11 02, 11 04
- Sum of expenditure titles (titel_nr does not start with "1") ≈ 47379.0 Tsd. €, tolerance ±1.0
- Sum of revenue titles (titel_nr starts with "1") ≈ 11.9 Tsd. €, tolerance ±1.0
No implementation code. Tests only. The function is called
extract_titles(pdf_path: Path) -> list[Titel].

Step 2 — Run the tests (must fail):

cd parser/
uv sync --extra dev
uv run pytest tests/ -v

Expected: ImportError: No module named 'parser.pdf_extract' — correct, the module does not exist yet.

Step 3 — Implementation. Give Claude Code this prompt after the red test:

The 6 tests are in parser/tests/test_pdf_extract.py — all fail with ImportError.

Implement parser/src/parser/pdf_extract.py:
- Dataclass Titel with the fields from docs/architecture.md (target model section)
- Function extract_titles(pdf_path: Path) -> list[Titel]
- Parse only Titelübersicht pages — skip annotations, Abschluss, foreword
- Important: column 6 holds stacked A/B/C values — do not read as soll_2026
- Scope: Epl11 only. No code beyond this scope.
- All 6 tests must be green before committing.

Step 4 — Run the tests (must be green):

uv run pytest tests/ -v

Expected: 6 passed in 0.22s

Before implementing the function:

ImportError: No module named 'parser.pdf_extract'

Correct. The module does not exist yet. This is the TDD starting state: red test, clear target.

Now the implementation prompt:

The 6 tests are in parser/tests/test_pdf_extract.py — all fail with ImportError.

Implement parser/src/parser/pdf_extract.py:
- Dataclass Titel with the fields from docs/architecture.md (target model section)
- Function extract_titles(pdf_path: Path) -> list[Titel]
- Parse only Titelübersicht pages — skip annotations, Abschluss, foreword
- Important: column 6 holds stacked A/B/C values (Budget 2025, Actual 2024, Actual 2023)
  — these must not be read as soll_2026
- Scope: Epl11 (30 pages). No code beyond this scope.
- All 6 tests must be green before committing — no commit before.

The reference to docs/architecture.md matters here. The agent already knows what column 6 means — it is in the repository. Without the mapped structure from Article 2, the prompt would be longer and more error-prone.

After implementation:

6 passed in 0.22s

How Tests Prevent Hallucinations

Tests 5 and 6 are the critical ones. The sum test forces the parser to read actual values from the PDF — not estimated, not hallucinated.

The Titelübersicht block in Epl11.pdf has six columns. Column 6 holds three stacked comparison values: A = Budget 2025, B = Actual 2024, C = Actual 2023 — vertically, one line per value, without their own title number. Architecture.md documents this pattern. A naive implementation that collects all decimal numbers in a text line pulls these column-6 values in as soll_2026 and produces wrong sums.

Without the sum test: the parser runs through, no exception, wrong result. With the test: immediate detection.

That is the difference between verification-before-completion as discipline and verification as an optional step. The working-principle rule “goal-driven execution” from CLAUDE.md makes it binding: success criteria before implementation, not after.

First Raw JSON

parser/output/epl11.json — 81 titles, 26 KB:

{
  "epl_nr": "11",
  "epl_name": "Bayerischer Oberster Rechnungshof",
  "haushaltsjahr": "2026",
  "titel_count": 81,
  "titel": [
    {
      "titel_nr": "111 01-0",
      "fkz": "011",
      "kap_nr": "11 01",
      "epl_nr": "11",
      "zweckbestimmung": "Gebühren, Beiträge, tarifliche und gebührenartige Entgelte",
      "soll_2026_tsd": null
    },
    {
      "titel_nr": "119 49-6",
      "fkz": "011",
      "kap_nr": "11 01",
      "epl_nr": "11",
      "zweckbestimmung": "Vermischte Einnahmen",
      "soll_2026_tsd": 5.0
    }
  ]
}

soll_2026_tsd: null represents --- in the PDF — no budget allocation for this title in fiscal year 2026. What is still missing: historical A/B/C values, annotation text, commitment appropriations. And: only Epl11 out of 16 Einzelpläne. That comes in Article 4.

CLAUDE.md Working Principles — Concrete Effect

The four rules were directly visible in the parser commit.

Minimum principle: The first draft wanted to process all 17 Einzelpläne at once. That was deferred. Epl11 (30 pages, smallest Einzelplan) as the first target — 81 titles, 6 tests green, clear foundation for extension.

Before coding: Assumptions made explicit: scope is Epl11, target is Titelübersicht pages, verification is sum consistency against the Abschluss page. That was established before the first implementation line.

Surgical changes: The parser actively skips annotation pages, Abschluss pages, and forewords — not silently, but explicitly through marker recognition. No code touches what the skill does not describe.

Goal-driven execution: The tests define the target. Only when 47379.0 ± 1.0 matches is the task considered complete — not when the code throws no exception.

State at the End of This Article

git clone https://codeberg.org/rotecodefraktion/byhaushalt.git
cd byhaushalt
git checkout v0.3
cd parser
uv sync --extra dev
uv run pytest tests/ -v

Full state at byhaushalt @ v0.3.

Tag v0.3 marks: first working parser, 6 green tests, 81 extracted titles from Epl11, custom skill with tested trigger description.

Where We Go Next

The next article covers Slash Commands — reusable workflows the agent executes on demand. For byhaushalt: a /compile-epl command that handles extraction, validation, and JSON export for any Einzelplan in a single step. The extraction from v0.3 is extended to all 16 Einzelpläne.

How CLAUDE.md and permissions lay the groundwork is covered in Article 1. The PDF structure analysis with Plan Mode is in Article 2.