The local coding agent put to the eval

Article 6 · Series: A Local Coding Agent with apfel

Five articles of building: CLI, serve protocol, Swift client, tool calling, hardened tools. The agent can read, write, execute. What stays open is the uncomfortable question of how much it actually gets done. This article builds no new feature, it measures. The thesis up front, so the finding is not mistaken for a takedown: the local model is not “worse” than a cloud model, it has a different operating profile. There are tasks where it carries and tasks where it breaks, and the dividing line does not run where you would expect it. A second question we measure separately: what the model claims about its own success, against what a machine determines about that success.

How it was measured

The eval is 15 tasks across five categories, three tasks each: small single-file edits, code explanations, short tool chains, multi-step plans, and tasks under context-window pressure. Each task runs three times, the result is the majority vote. The decisive point is in the scoring. Success or failure is never judged by the model but by a deterministic check() function per task: a grep for the new function name, a file diff against the expected output, an exit code. This is precisely the mistake a naive eval makes, and the one this article dissects. Ask the model whether it worked and you measure the self-report and call it a result.

The task canon carries an implicit choice of which coding tasks count as “typical”. That choice is a bias, and we name it as one. All numbers are own measurements against a local apfel serve (apfel 1.5.1, macOS 26.3, Foundation Model snapshot 2026-06-08); the full harness lives in the demo repo.

Where the model carries

The model carries reliably at explanations. Explain a short function in one sentence, name an off-by-one error, compare two functions: all three tasks succeed in the majority vote (3 of 3; own measurement v0.6). That fits the nature of the task. Explaining means producing text, and producing text is the core competence of a language model. It need not operate a tool, hit an argument schema, or change any state.

In the category of short tool chains the picture is mixed. Finding a file, reading it and extracting a value succeeds; listing, reading and summarizing, as well as read-edit-write-back, fail in the majority vote (1 of 3; own measurement v0.6). As soon as the chain contains a writing step, the model begins to stumble, not on the reading.

Where it breaks

The first surprising finding sits where you would have expected success: at the small single-file edits. Renaming a function, adding a docstring, turning a signature async, all three fail in the majority vote (0 of 3; own measurement v0.6). The model can formulate the correct code as text, but it does not reliably get it through the writing tool. Pick the tool, hit the keys, reproduce the full new content, and escape the JSON cleanly, the small model is not equal to that simultaneity. That much of this can be recovered with the right agent design is the subject of the next article.

Clearly and without surprise, the model breaks on multi-step plans. Add a test, run it, fix the error; carry a refactor across two files; build a small feature from a one-sentence description, all three fail (0 of 3; own measurement v0.6). Multi-step means the model has to hold a plan across several tool calls, and that holding is what breaks.

The context-window limit in practice

The Foundation Model works with a context window of 4096 tokens. In theory that is a number, in practice a hard wall. Of the tasks under context pressure only summarizing five short files succeeds; finding an inconsistency across four config files and carrying the same refactor across six files both fail (1 of 3; own measurement v0.6). As soon as several files have to be in context at once, the task statement, the file contents and the prior tool responses all compete for the same scarce space. What falls out is usually the start: the original instruction. The agent loses, mid-work, what it was supposed to do.

What the model says about itself

The separate measurement of the self-report delivers the most surprising result. Of 31 runs that count as failures by machine check, the model reported exactly one falsely as a success (own measurement v0.6). The small model is remarkably honest about its own failure. It rarely claims to be done when it is not.

That is a reassuring finding, but it is not the actual point. The point is that this rate is a property of this model in this eval, not a law of nature. The dangerous cell, claims success on actual failure, is the most expensive one in any agent system, and its frequency tends to rise with the capability of the model: larger models write more convincing “done, tests green” sentences, even when nothing is green. The self-report is a claim, not a proof, regardless of how rarely it was wrong here.

The cloud comparison as an operating profile

To place the limit of the local model, we ran one representative per category against a frontier model, Claude Sonnet 4.6 as a cloud agent with real file tools, checked with the same check() function as locally.

Category	Representative	Local	Cloud
Single-file edit	rename a function	fails	passes
Explanation	explain a function	passes	passes
Tool chain	read, edit, write back	fails	passes
Multi-step	refactor across two files	fails	passes
Large context	inconsistency across four files	fails	passes

Cloud sample: Sonnet 4.6 via Claude Code, 2026-06-08, one run per representative.

This is explicitly not a win comparison. It is an operating profile. The cloud agent carries everywhere the local model breaks, and the difference is model size, not the approach. The tool-calling protocol works, the tools work, the tasks are solvable. An honest caveat belongs here: the representatives are the easier ones of their category, and we did not measure the hardest tasks of the canon on the cloud side. The sample shows that the local model’s limit is a matter of model size, not that a cloud agent solves every task.

A consequence for the agent loop

From the self-report measurement follows a building rule, and it holds beyond our small agent. When the model reports “done”, that is another tool response, not a reliable state. Done is a tool result like any other and deserves the same suspicion. In the agent loop, the question of whether a task is finished belongs to a machine check: a re-read of the file, a test run, a diff against the expectation. The same check() discipline that carries this eval belongs in the loop itself. The second consequence follows from the findings: keep tasks small and local. The local model carries one step, one file, one clear edit; it breaks on the multi-step plan and the full context window. An agent built around that shape gets more out of the model than one that trusts it with the grand gesture.

Demo repo: run the eval yourself

The state of this article is frozen as tag v0.6. The artifact is not new agent code but the eval harness.

Reproduce the eval

Check out the tag:

git clone https://codeberg.org/rotecodefraktion/apfel-coding-agent.git
cd apfel-coding-agent
git checkout v0.6

New in v0.6 over v0.5:

eval/tasks/01..15 — 15 tasks, each with prompt, fixtures and a deterministic check()
eval/run.sh / eval/run-all.sh — single and full run against a local apfel serve
eval/report.py — renders eval/results.md from the raw runs
eval/results.md — majority vote per task and self-report discrepancy
eval/cloud-reference.md — the cloud sample and the comparison

Start a serve and run the eval (apfel defaults to the same port as Ollama, so use a separate one):

apfel --serve --port 11509 &
APFEL_PORT=11509 ./eval/run-all.sh
python3 eval/report.py > eval/results.md

Every check() function judges deterministically, never the model. A single run for a spot check:

APFEL_PORT=11509 ./eval/run.sh eval/tasks/04-explain.sh

What stays open

The biggest reservation is the bias of the canon. Fifteen tasks across five categories are a selection, not a full survey, and the selection co-determines where the dividing line runs. Other tasks would show other strengths and weaknesses. The eval is reproducible so that this selection stays auditable and can be extended.

The sharpest single finding remains the unexpected one: even the small single-file edit fails naively. The model knows the code, but it does not operate the tool. That is exactly where the next article starts. We build the agent not around a stronger model but around the weaknesses of the one we have, and measure how much of the edit can be recovered.

Previous article: The first real tools: file system and shell. Next article: Editing that works: constrained output instead of tool guessing. Repo tag: v0.6.