The agent in the browser

Article 11 · Series: A Local Coding Agent with apfel

The agent has had two shapes so far. First a command that sends a prompt and prints the reply, then an interactive session in the terminal that holds a conversation and uses its tools with confirmation. Both live in the terminal. Now we give it a third shape: behind a server, with a surface in the browser. Nothing about the agent itself changes. What changes is the sink its work flows into. The state is frozen as tag v1.0.

The same logic, a different sink

The interactive session from Article 9 was built from the start not to know its own input and output. It is handed three closures: one that reads a line, one that writes text, one that answers a turn. The terminal wires them to the keyboard and the screen. That split pays off now. The server wires the same logic to HTTP and a browser, without changing a line of the agent core.

For both surfaces to truly share the core, they need a common language for what happens during a turn. We give those moments a name, as an enum:

public enum AgentEvent: Sendable, Equatable {
    case token(String)
    case toolCall(name: String, arguments: String)
    case toolResult(name: String, ok: Bool, summary: String)
    case confirmation(tool: String, diff: String)
    case done(reason: String)
}

A turn falls out as a stream of such events. The terminal renders them to text, the server sends them to the browser as Server-Sent Events. Same core, two sinks. The type lives in the agent core and knows nothing of HTTP; the translation to the wire format is an extension in the server module. The core stays free of the transport layer.

Two layers, two jobs

Below the agent sits apfel-serve, still exposing the model as an OpenAI-compatible endpoint. Above it now comes a Hummingbird server. At first glance that looks like one layer too many, a proxy in front of a proxy. It is not. The two servers have different jobs. apfel-serve delivers model tokens. Hummingbird is where those tokens become an agentic turn: pick tools, run them, feed results back, ask before writing actions. One speaks the model protocol, the other orchestrates the agent.

The server is its own target, AgentServer, built on Hummingbird 2. The routes are few:

let router = Router()
router.get("health") { _, _ in "ok" }
router.get("/") { _, _ in /* the single-page UI */ }
router.post("chat") { request, context in /* stream a turn as SSE */ }
router.post("confirm") { request, context in /* resolve a parked confirmation */ }

Four routes, and the two interesting ones are chat and confirm. The first opens an event stream, the second is the back-channel for confirmation.

The turn as an event stream

A chat request returns no finished answer but a stream. Server-Sent Events are the simplest means for that which the browser understands natively: one open HTTP response into which the server writes line by line, each beginning with data: and separated by a blank line. In Hummingbird 2 the response is a ResponseBody with a writer that we feed from an event stream:

let body = ResponseBody { writer in
    for await event in stream {
        try await writer.write(ByteBuffer(string: event.sseEncoded()))
    }
    try await writer.finish(nil)
}

The turn runs concurrently and pushes its events into the stream, the writer pulls them out and writes them as frames. In the browser a small piece of JavaScript consumes the frames and builds the surface from them: tokens are appended to the answer, tool calls shown as small chips, the done event frees the input again. No framework, just fetch and a parser for the data: lines.

A turn routes exactly as in the terminal session. A request that wants to change a file goes through the constrained edit flow from Article 7, not through free tool-calling, which fails at editing. Everything else, that is reading, listing, explaining, running commands, goes through a read-only tool round-trip. The writing tools are not even in that registry; they reach the user only through the edit flow with its gate.

Confirming without a keypress

In the terminal the confirmation gate blocks on a keypress. It shows the diff and waits for y, n or a. Over HTTP there is no keypress and no synchronous human at the other end. The rule should hold anyway: no file is written, no command runs, before someone has agreed.

The gate has been a protocol with a single asynchronous method since Article 5. That is exactly what makes the web variant simple. The WebGate writes a confirmation event into the live stream and parks the turn on a continuation:

public func confirm(_ action: PendingAction) async -> Decision {
    await emit(.confirmation(tool: action.toolName, diff: action.preview))
    if let buffered { self.buffered = nil; return buffered }
    return await withCheckedContinuation { self.pending = $0 }
}

public func resolve(_ decision: Decision) {
    if let pending { self.pending = nil; pending.resume(returning: decision) }
    else { buffered = decision }
}

The turn stands still, mid-work. In the browser the diff appears with three buttons. A click sends POST /confirm with the decision, resolve wakes the continuation, the turn continues. The small buffer covers the case where the answer arrives before the turn has even parked.

In the run against apfel it looks like this. We ask the agent to insert a line into a file. The stream halts:

data: {"diff":"+ GREETING to note.txt","tool":"write_file","type":"confirmation"}

Here the stream ends for now. No done, no write. Only POST /confirm with deny brings it to a close:

data: {"text":"Insertion in note.txt declined.","type":"token"}
data: {"reason":"complete","type":"done"}

The file stays untouched. The same safety as in the terminal, translated into two HTTP requests.

Where the model disturbs the stream

The server makes one property of the small model more visible than any surface before: its unreliability at tool-calling, measured across three calls of the same request “tell the time, use your tool” (own measurement v1.0). Once a clean tool call came through:

data: {"arguments":"{}","name":"get_time","type":"tool_call"}

The other two times it did not. Once the model wrote the tool call out as text, cast in JSON but as a reply rather than a call, so no tool ran. Once the intent classifier missed the boundary and pushed a harmless question into the edit flow. This is the same weakness from Articles 4 to 7, and the event stream does not hide it, it shows it frame by frame. A surface that streams honestly streams the lapses along with it.

Sessions and limits

The server keeps one session per browser, created on first contact and handed back via a header. A session carries its conversation history across turns and its currently open gate. Within a session turns run one after another, different sessions run independently. The registry is an actor, so two concurrent requests, say a chat and a confirm, cannot get in each other’s way.

We draw two limits deliberately. The final text arrives as one token event, not character by character. Live is the level of the turn, that is the tools, the confirmation, the close; streaming the model text character by character would be a further step and would require the loop itself to stream. And there is no login, no persistence, one sandbox per process. A production multi-user server with auth and rate-limiting is its own topic, which the Hummingbird series covers.

Demo repo: apfel-coding-agent v1.0

The state of this article is frozen as tag v1.0: https://codeberg.org/rotecodefraktion/apfel-coding-agent/src/tag/v1.0

Try the server

First apfel-serve, then the agent server, from the repo root:

git clone https://codeberg.org/rotecodefraktion/apfel-coding-agent.git
cd apfel-coding-agent
git checkout v1.0
swift build --product apfel-agent-server

apfel --serve --port 11577        # the model (11434 is often taken by Ollama)
.build/debug/apfel-agent-server \
  --base-url http://127.0.0.1:11577 --port 8099 --workdir .

Then open http://127.0.0.1:8099 in the browser. Without a browser, by hand:

curl -sN -X POST http://127.0.0.1:8099/chat \
  -H 'content-type: application/json' \
  -d '{"message":"Read note.txt and summarise it."}'

New in v1.0 over v0.10:

Sources/AgentServer/ — the Hummingbird server (routes, SSE, WebGate, session registry)
Sources/apfel-agent-server/ — the executable
Sources/AgentCore/Agent/AgentEvent.swift — the shared turn language
web/index.html — the single-page UI
docs/usage-server.md, docs/adr/006-hummingbird-frontend.md

Conclusion

The agent now has three surfaces, and all three sit on the same core. That is the real lesson of this article: separate orchestration from output cleanly and the next surface comes almost for free. The terminal became a server by swapping the output sink and running a gate over two HTTP requests. The agent stays what it was from the start, a local client against a model on your own device. Before we put it into Xcode for the climax of the series, the next step pauses to ask what all this sovereignty actually rests on.

Previous article: Tools the agent doesn’t write: MCP. Next article: Sovereignty on borrowed ground. Repo tag: v1.0.