Hooks — Tests as a Gate in the Background

Article 9 · Series: Agentic Coding with Claude Code

After Article 8, byhaushalt has three test layers: pytest for the parser, Vitest for the React components, Playwright for end-to-end. One format standard per language: Ruff for Python, Prettier for TypeScript. And no mechanism whatsoever that enforces this automatically. Anyone who edits a line in the heat of a session and forgets to format leaves behind a diff full of whitespace noise. Anyone who leaves a print("DEBUG") in the parser and commits it costs review time. Anyone who ends a session without running pytest does not know whether the change still satisfies the sum constraint. Hooks solve this without requiring human or model to remember.

Hooks, Skills, and Slash Commands — Who Triggers What

Claude Code distinguishes three mechanisms for embedding recurring actions into a project. They overlap on the surface, but who triggers them is different in each case:

Mechanism	Triggered by	When	Can block?
Skill	the model, when a trigger matches its description	during the conversation	no
Slash command	the human, by typing `/<name>`	immediately on invocation	no
Hook	the system, automatically on an event	before/after tool use, at turn end, at session start	yes (PreToolUse, Stop)

A skill like e2e-spec from Article 8 waits for the model to invoke it — we write “generate the E2E test”, the model recognizes the skill description and executes it. A slash command like /check-totals needs the human to type it. A hook needs neither — it fires automatically when the defined event occurs. That is precisely what makes it strong as a quality gate: whatever is bound to a hook runs even when nobody thought of it.

Hook Events at a Glance

Claude Code defines more than twenty events to which a hook can be attached. Three of them are central for quality gates:

Event	When	Typical use
`PreToolUse`	before a tool runs	block or modify the action
`PostToolUse`	after a tool succeeds	process the result, format
`Stop`	when the model ends its turn	final check, tests

There are more events beyond these: SessionStart loads context at the beginning, UserPromptSubmit validates user input, SubagentStop reacts when a subagent finishes. The documentation lists them; for v0.9 we only use the three central ones.

A hook receives the event payload as JSON on stdin. For PreToolUse and PostToolUse it contains, among other fields, tool_name and tool_input — for Edit the file path, for Bash the command. The hook decides based on this data what to do: log, format, block. The exit code controls the outcome: exit 0 always means “continue”, exit 2 means “block” and returns stderr as the reason. For finer control the hook can print JSON to stdout — for instance permissionDecision: "deny" with a detailed reason.

Plan File for Article 9

Task 1 (parallel): PostToolUse format hook.
  Reads tool_input.file_path. Extension → ruff or prettier.
  Exit always 0 (formatting is best-effort).

Task 2 (parallel): PreToolUse commit guard.
  Reads tool_input.command. If `git commit`: scan staged diff
  for print(, console.log(, breakpoint(), debugger;.
  On match: JSON with permissionDecision: deny.

Task 3 (after Task 1 + 2): Stop hook with smart guard.
  Reads git diff --name-only. Only test layers that changed.
  Playwright deliberately excluded.

Task 1: Formatting After Every Edit

The first hook is the simplest and the one with the highest frequency. After every Edit or Write a shell script runs, takes the edited file, and passes it to the matching formatter. Python files go to Ruff, everything under web/ goes to Prettier:

#!/usr/bin/env bash
set -uo pipefail
INPUT=$(cat)
FILE=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')
[[ -z "$FILE" ]] && exit 0

PROJECT_DIR="${CLAUDE_PROJECT_DIR:-$(pwd)}"
REL="${FILE#$PROJECT_DIR/}"

case "$REL" in
  *.py)
    (cd "$PROJECT_DIR" && uvx ruff format "$REL") >/dev/null 2>&1 || true ;;
  web/*.ts|web/*.tsx|web/*.json|web/*.md|web/*.yml|web/*.yaml|web/*.css)
    SUB="${REL#web/}"
    (cd "$PROJECT_DIR/web" && npx --yes prettier@3 --write "$SUB") >/dev/null 2>&1 || true ;;
esac

exit 0

The choice of uvx over a local Ruff installation is deliberate. uvx downloads the tool on demand and caches it in the uv cache; Ruff does not need to be in the parser’s dev dependencies. The same goes for npx --yes prettier@3 for the frontend files: Prettier is loaded on demand if it is not in web/node_modules/. The trade-off: the first edit after a reset of the uv or npm cache takes a few seconds more for the download. After that the hook runs under 100 ms.

What matters is the || true wrapping around both formatter calls. If Ruff or Prettier fail to run for any reason — no internet connection on the first call, broken configuration — that must not block the edit. Formatting is best-effort, not a quality gate. Anyone who returns exit 2 here causes a code change to fail because of the formatter, which leads quickly to frustration over the course of a session.

The registration in .claude/settings.json makes the matcher explicit:

"PostToolUse": [
  {
    "matcher": "Edit|Write",
    "hooks": [
      {
        "type": "command",
        "command": "${CLAUDE_PROJECT_DIR}/.claude/hooks/post-edit-format.sh",
        "timeout": 30
      }
    ]
  }
]

The matcher is a regex on the tool name — Edit|Write reacts to both, but ignores Read, Grep, Bash, and everything else. If you omit the matcher, the hook fires on every tool. That is rarely what you want.

Task 2: Commit Guard Against Debug Leftovers

The second hook attaches to PreToolUse with matcher Bash. We do not want to check on every shell command, though, only on git commit. The filter happens inside the script:

#!/usr/bin/env bash
set -uo pipefail
INPUT=$(cat)
CMD=$(echo "$INPUT" | jq -r '.tool_input.command // empty')

if ! echo "$CMD" | grep -qE '^[[:space:]]*git[[:space:]]+commit(\b|$)'; then
  exit 0
fi

cd "${CLAUDE_PROJECT_DIR:-$(pwd)}" || exit 0

PY_PATTERNS='(\bprint\(|\bbreakpoint\(\)|\bpdb\.set_trace|\bipdb)'
JS_PATTERNS='(\bconsole\.(log|debug)\(|\bdebugger[[:space:]]*;|\.only\()'

DIFF=$(git diff --cached --unified=0 2>/dev/null)
[[ -z "$DIFF" ]] && exit 0

HITS=""
CURRENT_FILE=""
while IFS= read -r line; do
  if [[ "$line" =~ ^\+\+\+\ b/(.+)$ ]]; then
    CURRENT_FILE="${BASH_REMATCH[1]}"
    continue
  fi
  [[ "$line" =~ ^\+[^+] ]] || continue
  content="${line:1}"
  case "$CURRENT_FILE" in
    *.py)
      echo "$content" | grep -qE "$PY_PATTERNS" && \
        HITS="${HITS}${CURRENT_FILE}: ${content}"$'\n' ;;
    *.ts|*.tsx|*.js|*.jsx)
      echo "$content" | grep -qE "$JS_PATTERNS" && \
        HITS="${HITS}${CURRENT_FILE}: ${content}"$'\n' ;;
  esac
done <<< "$DIFF"

if [[ -n "$HITS" ]]; then
  REASON="Commit blocked — debug leftovers in staged diff:\n${HITS}\nRemove them or unstage the file."
  jq -n --arg r "$REASON" '{
    hookSpecificOutput: {
      hookEventName: "PreToolUse",
      permissionDecision: "deny",
      permissionDecisionReason: $r
    }
  }'
fi

exit 0

Two details deserve attention. First, the regex for command detection: ^[[:space:]]*git[[:space:]]+commit(\b|$). It matches git commit, git commit -m, git commit -a, but not git config commit.gpgsign true or git committed-files-script.sh. The trailing \b matters — without it git commit would also match git committers-list.

Second, the diff parsing: git diff --cached --unified=0 returns only the headers and the changed lines, no context lines. We are only interested in added lines (^\+[^+] — plus at the beginning, but not +++ from the header). That way we catch debug leftovers that are new in this commit, but deliberately ignore debug code that was already in the repo and was not touched. Anyone who wants to clean up legacy needs a different mechanism — the commit guard is built as a filter for freshly added debug leftovers.

Returning JSON instead of a plain exit 2 has the advantage that the reason reaches the model in structured form. Claude sees in the chat exactly which file and which line was blocked, and can go back and remove the print precisely, instead of repeating the commit call and being puzzled.

Task 3: Stop Hook With Smart Guard

The third hook is the trickiest. Stop fires after every response the model finishes — not only at session end, but also after every intermediate response in a longer workflow. In a session with twenty turns the hook runs twenty times. Anyone who launches the full test suite here — pytest plus Vitest plus Playwright — produces ten to twenty minutes of pure waiting time per session. That eats the speed advantage of agentic coding entirely.

Our Stop hook therefore uses a smart guard: it reads git diff --name-only, looks at which layers were touched, and triggers only the matching test suite. A read-only response in which only Grep or Read was called passes through the hook in 50 ms without triggering anything.

#!/usr/bin/env bash
set -uo pipefail
cd "${CLAUDE_PROJECT_DIR:-$(pwd)}" || exit 0

CHANGED=$( {
  git diff --name-only HEAD 2>/dev/null
  git diff --cached --name-only 2>/dev/null
  git status --porcelain 2>/dev/null | awk '{print $NF}'
} | sort -u )

[[ -z "$CHANGED" ]] && exit 0

PARSER_CHANGED=$(echo "$CHANGED" | grep -E '^parser/' || true)
WEB_CHANGED=$(echo "$CHANGED" | grep -E '^web/src/' || true)

FAILS=""

if [[ -n "$PARSER_CHANGED" ]]; then
  OUT=$(cd parser && uv run --extra dev pytest -x -q 2>&1 | tail -3) \
    || FAILS="${FAILS}[stop-hook] pytest failed:\n${OUT}\n"
fi

if [[ -n "$WEB_CHANGED" ]]; then
  OUT=$(cd web && npm run test -- --run --reporter=basic 2>&1 | tail -3) \
    || FAILS="${FAILS}[stop-hook] vitest failed:\n${OUT}\n"
fi

if [[ -n "$FAILS" ]]; then
  printf "%b\n" "$FAILS" >&2
fi

exit 0

Three design decisions are not obvious. First, the layer mapping: parser/ triggers pytest, web/src/ triggers Vitest. Other areas get changed too — .claude/, plans/, docs/ — that have no tests, so do not justify a run either. Anyone who broadens the hook risks test runs on every documentation change.

Second, the tail -3 at the end of both test outputs. On success that is the pytest/vitest summary (“3 passed in 0.42s”), on failure the test location. Writing full test outputs to stderr would be loud and unusable — three lines give enough context to see what went wrong.

Third, the exit code: even on failed tests the hook returns exit 0. Stop hooks can indeed prevent the model from finishing its turn via exit 2 — but that is usually not what you want. Anyone who makes the hook hard-blocking sits in a worst-case loop: model wants to end the turn, hook blocks because of test fail, model tries to fix, hook blocks again. The warning variant is better: the hook writes to stderr, the model sees it in the next turn and can react. Anyone wanting to interrupt the hook presses Ctrl+C — which cancels the test run, not the session.

What is deliberately left out is Playwright. Three reasons:

Startup cost. Playwright launches a real Chromium (about five seconds just for the browser spawn) and a Vite dev server (three to eight seconds). With zero actual tests, eight to thirteen seconds of overhead already exist per turn. Over a session of twenty turns that is several minutes of waiting for nothing.

Port conflicts. playwright.config.ts starts webServer on port 5173. If a manual npm run dev is running in parallel, there is a conflict or a reuseExistingServer race. Normal in interactive development mode, broken in the hook context.

Baseline drift. The visual regression screenshots in tests/e2e/screenshots/baseline/ are rewritten on every Playwright run with --update-baseline. A hook that runs after every response produces constantly shifting baselines. That belongs in a controlled environment, not in every turn end.

Playwright therefore stays manual (npm run test:e2e when the human wants it) and moves into CI in Article 10, which runs once per push.

What the Smart Guard Does Not Catch

The git diff check is pragmatic, not complete. Three edge cases it deliberately misses:

Deleted files. If you throw away a tested file, the layer has changed, but git diff --name-only might list the path as deleted. The hook would still trigger because the path lands in CHANGED, but the tests could fail on the missing module. Rare in practice, caught by CI.
History rewrites. A git rebase -i or git commit --amend rewrites the history, but git diff HEAD afterwards may show less than expected. The hook only sees what is in the working tree.
Branch switches without commits. If you switch branches and the new branch has different tests, the hook does not see that — it only knows the current working tree.

For the 95 percent of sessions where you sit on a branch and edit code, the check is enough. For the remaining 5 percent the CI in Article 10 catches it.

Debugging Hooks

When a hook silently fails, nothing shows in the chat. That is exactly the trap: a broken hook produces no error message, it is simply absent. The formatter does not run, the commit guard does not catch, the tests get skipped — and we notice it only when the code surfaces in review.

The path forward is claude --debug. In the debug log every hook call appears with stdin, stdout, stderr, and exit code. A typical debug excerpt from a failed format hook:

[DEBUG] PostToolUse hook fired: post-edit-format.sh
[DEBUG] Hook stdin: {"tool_name":"Edit","tool_input":{"file_path":"/.../normalize.py"}, ...}
[DEBUG] Hook stderr: jq: error: Cannot iterate over null (null)
[DEBUG] Hook exit: 0 (suppressed because exit 0)

The hook here ran with exit 0 — so no visible error — but jq failed internally. The cause: an older Claude Code with a different stdin schema. Without --debug, this bug would have stayed invisible.

A second useful trick: feed the script test JSON directly.

echo '{"tool_input":{"file_path":"/path/to/file.py"}}' | .claude/hooks/post-edit-format.sh
echo "EXIT $?"

What works here also works in the hook. What fails here also fails there — but can be debugged in isolation without needing the session as a reproducer.

Status at the End of This Article

git clone https://codeberg.org/rotecodefraktion/byhaushalt.git
cd byhaushalt
git checkout v0.9
ls .claude/hooks/
# post-edit-format.sh
# pre-commit-guard.sh
# stop-quick-tests.sh
cat .claude/settings.json | jq '.hooks | keys'
# ["PostToolUse", "PreToolUse", "Stop"]

Full state at byhaushalt @ v0.9.

v0.9 contains three hook scripts under .claude/hooks/, the expanded .claude/settings.json with three hook registrations, and the plan file plans/v0.9-hooks.md. The test count is unchanged since v0.8: 19 pytest passing with 3 documented xfails, 3 Vitest passing, 3 Playwright passing. What has changed is not the number of tests, but how often they run — Vitest and pytest now automatically on every relevant change, format on every edit, commit block on every print( leftover.

What Comes Next

Article 10 closes the series: Codeberg Woodpecker as CI, running all three test layers in a clean environment, and a static deploy of the visualization onto a subdomain. The fast loop is done with hooks — the slow loop moves into CI, where Playwright finally finds its place.

How the e2e-spec skill translates Markdown specs into Playwright tests is in Article 8. How the Playwright MCP server is configured is in Article 7.