The agent loop with a done-check

Article 8 · Series: A Local Coding Agent with apfel

So far the agent runs single steps: one tool round-trip, one constrained edit. A real coding agent does more, it works toward a goal across several steps until it is reached. That needs a loop. The obvious way to build that loop, however, carries a flaw we already measured in Article 6. This article builds the loop so that it avoids the flaw, and turns the central consequence of the eval into code: done is a check, not a claim. The state is frozen as tag v0.8.

Why the naive loop only defers the problem

The classic agent loop is plan, act, observe. The model proposes a tool call, we run it, feed the result back, and that repeats until the model makes no more tool calls. That stop condition is the problem. “No more tool calls” means nothing other than: the model considers itself done.

Article 6 measured what that self-report is worth. It is a claim, not a proof, and its reliability tends to fall, not rise, with model size. A loop that ends on the model’s silence measures that same self-report again and calls it completion. We keep this naive variant in the repo as a contrast and build the actual loop differently.

Separating drive from stop

The fix is a clean split of two roles the naive loop conflates. The model is the drive: it proposes the next step, a tool call or a text reply. But the stop does not belong to the model, it belongs to the program. After each step the program checks against an explicit goal whether the task is met.

We wrap that check behind a narrow protocol. It returns a deterministic verdict and an output that works as feedback:

public struct VerifyResult: Sendable, Equatable {
    public let passed: Bool
    public let output: String
}

public protocol Verifier: Sendable {
    func verify() async -> VerifyResult
}

The goal as a verify command

What does an “explicit, machine-checkable goal” look like in practice? Most honestly as the thing a developer already trusts: a command that either passes or does not. ShellVerifier runs a shell command, exit code 0 means met, anything else means not yet, with the output as feedback:

public struct ShellVerifier: Verifier {
    let command: String
    let workdir: URL

    public func verify() async -> VerifyResult {
        // ... Process with /bin/zsh -c command in the workdir ...
        let outData = stdout.fileHandleForReading.readDataToEndOfFile()
        let errData = stderr.fileHandleForReading.readDataToEndOfFile()
        process.waitUntilExit()
        // ...
        return VerifyResult(passed: process.terminationStatus == 0, output: combined)
    }
}

On the command line the goal thus becomes the verify command. The task is in the prompt, the success criterion in --until:

apfel-agent --until "swift test" "Make the failing test in CalculatorTests pass."

The loop runs until swift test exits 0. The model decides the steps, but whether the goal is reached is decided by the compiler and the test suite, not by the model.

A failed verify is feedback, not an abort

A failed verify does not end the loop. Its output goes back into the conversation as the next message, so the model reacts instead of guessing on in the dark. That is the heart of the loop:

public func run(_ task: String) async throws -> Result {
    var conversation = [ChatMessage(role: "user", content: task)]

    for iteration in 1...maxIterations {
        let response = try await complete(conversation, nil)
        guard let choice = response.choices.first else {
            return Result(outcome: .exhausted(iterations: iteration), finalContent: nil)
        }

        // Drive: run the step the model proposed.
        if let calls = choice.message.toolCalls, !calls.isEmpty {
            conversation.append(ChatMessage(assistantToolCalls: calls))
            for call in calls {
                conversation.append(ChatMessage(toolCallID: call.id, content: await result(for: call)))
            }
        } else if let content = choice.message.content {
            conversation.append(ChatMessage(role: "assistant", content: content))
        }

        // Stop: the machine check, not the model's silence.
        let verdict = await verifier.verify()
        if verdict.passed {
            return Result(outcome: .done(iterations: iteration), finalContent: choice.message.content)
        }

        // A failed verify is feedback.
        conversation.append(ChatMessage(
            role: "user",
            content: "Not done yet. The check still fails:\n\(verdict.output)\nKeep going."
        ))
        conversation = context.trim(conversation)
    }

    return Result(outcome: .exhausted(iterations: maxIterations), finalContent: nil)
}

The difference from the naive loop is in a single line: we check verifier.verify(), not whether the model still makes tool calls. A test pins it down. The model replies with plain text, no tool call, which the naive loop would immediately count as done. The verifier passes only on the second check, so the loop must run a second round:

@Test("the model's silence is not 'done' — only the verifier decides")
func silenceIsNotDone() async throws {
    let counter = PassCounter(passAt: 2)
    let loop = makeLoop(Scripted([textResponse("I think I'm finished")]),
                        verifier: ClosureVerifier { VerifyResult(passed: await counter.tick(), output: "still red") })
    let result = try await loop.run("do it")
    #expect(result.outcome == .done(iterations: 2))
}

Two guards

A loop that does not listen to the model needs its own bounds, or it runs forever or blows the context window. We add two.

The first is an iteration limit. If the verify never passes, the loop ends after a fixed number of rounds with exhausted instead of in an infinite loop. That is not an edge case but the honest outcome when the model cannot solve the task.

The second is a context budget. The Foundation Model works with 4096 tokens (Article 6). A multi-step loop accumulates tool results each round until the window overflows and the model loses sight of the original task. The ContextManager trims the history before that happens. It keeps the task and the most recent rounds, drops the oldest middle messages, and leaves no orphaned tool result behind:

public func trim(_ messages: [ChatMessage]) -> [ChatMessage] {
    guard messages.count > 2, Self.estimate(messages) > maxTokens else { return messages }

    var kept = messages
    while kept.count > 2, Self.estimate(kept) > maxTokens {
        kept.remove(at: 1)
    }
    // A tool result without its call would be an orphaned remnant.
    while kept.count > 1, kept[1].role == "tool" {
        kept.remove(at: 1)
    }
    return kept
}

The token estimate is deliberately rough, about four characters per token plus a small per-message overhead. The goal is a budget that prevents overflow, not an exact count.

Making the loop observable

A loop that takes several steps on its own has to be visible. The agent reports what it does on stderr, the final text goes to stdout. At the end stands the outcome as one line that captures the loop’s nature:

→ done after 1 iteration(s): verify passed

or, when the model fails:

→ exhausted after 8 iteration(s): verify still failing

Both outcomes are facts. The first is a passed verify, the second an exhausted bound. Neither is a claim by the model.

A run by example

A real run against apfel, against a working directory with one file. The goal is to write the word DONE into the file, the verify command checks exactly that:

apfel-agent --workdir /tmp/work \
  --until "grep -q DONE report.txt" \
  "Write the word DONE into the file report.txt. Use the write_file tool."

The model calls write_file, the diff is shown at the confirmation gate from Article 5, and after the write grep -q DONE report.txt passes:

Write to report.txt?
- placeholder
+ DONE

→ done after 1 iteration(s): verify passed

The loop ended because a machine confirmed the goal as reached, not because the model claimed to be done (own measurement v0.8, apfel 1.5.1).

What the loop does not rescue

The loop does not move the limit Article 7 drew. It gives the model more attempts and a deterministic stop criterion, but it does not improve the model’s coding judgment. When the task demands an edit the small model cannot produce, such as the ambiguous switch to async, no patient loop satisfies the verify, and it ends honestly with exhausted. That is not a flaw of the loop but its honesty: it claims no success that is not there.

Demo repo: apfel-coding-agent v0.8

The state of this article is frozen as tag v0.8: https://codeberg.org/rotecodefraktion/apfel-coding-agent/src/tag/v0.8

Try the goal loop

Check out the tag:

git clone https://codeberg.org/rotecodefraktion/apfel-coding-agent.git
cd apfel-coding-agent
git checkout v0.8

New in v0.8 over v0.7:

Sources/AgentCore/Agent/GoalLoop.swift — the goal-driven loop
Sources/AgentCore/Agent/Verifier.swift — the machine done-check
Sources/AgentCore/Agent/ContextManager.swift — the context budget
--until in the CLI
docs/adr/004-done-ist-eine-pruefung.md

Build, test, run against a running apfel (own port, since Ollama takes the default):

swift build
swift test                        # offline, no apfel needed
apfel --serve --port 11509 &
mkdir -p /tmp/work && printf 'placeholder\n' > /tmp/work/report.txt
swift run apfel-agent --workdir /tmp/work \
  --base-url http://127.0.0.1:11509 \
  --until "grep -q DONE report.txt" \
  "Write the word DONE into the file report.txt. Use the write_file tool."

The unit tests check the loop offline against a scripted fake backend: that the model’s silence does not count as done, that the iteration limit holds, that the history is trimmed correctly.

What remains

The agent now works toward a goal across several steps and knows deterministically when it is done. The scaffold stands, the drive comes from the model, the stop from a check, the bounds keep it in frame. What is missing is the surface that makes it tangible. So far the agent is a one-shot command; the next step turns it into an interactive session where streaming, diffs and confirmations come together in the terminal.

Previous article: Editing that works: constrained output instead of tool guessing. Next article: The interactive terminal session. Repo tag: v0.8.