The serve mode and the OpenAI protocol

The serve mode and the OpenAI protocol

Article 2 · Series: A Local Coding Agent with apfel

Article 1 showed what apfel can do on the command line. Now we are interested in the other side: apfel --serve turns the local Foundation Model into an HTTP server with an OpenAI-compatible API. This is the layer our Swift client will dock onto in Article 3: no direct framework call, no platform lock, but a protocol that every SDK speaks which expects an OpenAI endpoint. In this article we start the server, read through every endpoint, send real requests by hand and work out where the protocol diverges from OpenAI. The state of this article is frozen as tag v0.2 in the demo repo: https://codeberg.org/rotecodefraktion/apfel-coding-agent/src/tag/v0.2

Starting apfel –serve

apfel --serve

In the terminal the server prints a banner with the active settings (here started with --debug, hence the Hummingbird log line at the end):

apfel server v1.5.1
├ endpoint: http://127.0.0.1:11434
├ model:    apple-foundationmodel
├ cors:     disabled
├ origin:   localhost only (http://127.0.0.1, http://localhost, http://[::1])
├ token:    none
├ health:   public
├ max concurrent: 5
├ debug:    on
└ ready

Endpoints:
  POST http://127.0.0.1:11434/v1/chat/completions
  GET  http://127.0.0.1:11434/v1/models
  GET  http://127.0.0.1:11434/v1/logs
  GET  http://127.0.0.1:11434/v1/logs/stats
  GET  http://127.0.0.1:11434/health

2026-06-03T08:19:01+0200 info Hummingbird: [HummingbirdCore] Server started and listening on 127.0.0.1:11434

Once └ ready appears, the server accepts requests. That the model was already loaded into memory at startup we see in a moment in the /health response at the field prewarmed: true, so the first request arrives without cold-start latency.

The HTTP server runs on Hummingbird, the Swift HTTP framework from swift-server/hummingbird. The log line [HummingbirdCore] Server started and listening on 127.0.0.1:11434 makes this visible. Anyone who knows rotecodefraktion’s hummingbird-llm series will recognise the same server layer: here it is packaged ready-made inside apfel, without us having to write any server code ourselves. (We infer this from the log prefix, not from a source audit.)

Server options. The banner shows the defaults; all of them can be overridden:

Flag / envDefaultFunction
--port <n> / APFEL_PORT11434TCP port
--host <addr> / APFEL_HOST127.0.0.1Bind address
--corsoffCORS headers for browser clients
APFEL_TOKENnoneBearer token; if set, all requests need Authorization: Bearer <token>
--debugoffRequest log and event traces (prerequisite for /v1/logs)

A modified start for local development with CORS enabled and a different port:

apfel --serve --port 3000 --host 0.0.0.0 --cors

We come to the implications of --host 0.0.0.0 in the section on the security posture.

The endpoint map

apfel exposes five endpoints:

EndpointMethodOpenAI equivalentNote
/healthGETnoLiveness + model status
/v1/modelsGETyesModel card with supported/unsupported parameters
/v1/chat/completionsPOSTyesChat completion, non-stream and SSE
/v1/logsGETnoRecent requests with full bodies + event trace
/v1/logs/statsGETnoAggregate stats

/health is public, even when APFEL_TOKEN is set. The two /v1/logs endpoints are apfel-specific extensions with no OpenAI equivalent; they only return data when the server was started with --debug. Without --debug apfel answers with HTTP 400 and "Request log stats are only available when the server is started with --debug.".

/health gives the system status at a glance:

curl -s http://127.0.0.1:11434/health | jq .
{
  "active_requests": 0,
  "context_window": 4096,
  "model": "apple-foundationmodel",
  "model_available": true,
  "prewarmed": true,
  "status": "ok",
  "supported_languages": ["fr","da","it","pt","es","sv","de","vi","ja","nl","nb","zh","en","tr","ko"],
  "version": "1.5.1"
}

model_available: true means the Foundation Model is loaded and responding. context_window: 4096 is the hard limit value we come back to in the deviation table. The language list in the live response contains some codes twice; above is the deduplicated set. (Own measurement 2026-06-03 with apfel 1.5.1 on macOS 26.3.)

/v1/models returns what apfel declares about itself:

curl -s http://127.0.0.1:11434/v1/models | jq '.data[0] | {id, owned_by, context_window, supported_parameters, unsupported_parameters}'

The one registered model is apple-foundationmodel, owned_by: apple, context_window: 4096. The fields supported_parameters and unsupported_parameters are apfel-specific additions to the OpenAI schema:

  • Supported: temperature, max_tokens, seed, stream, tools, tool_choice, response_format, x_context_strategy, x_context_max_turns, x_context_output_reserve
  • Not supported: logprobs, n, stop, presence_penalty, frequency_penalty

The x_ parameters are apfel’s context-management flags from chat mode, mapped onto the request body. More on that in the deviation section.

A chat completion request by hand

curl -s http://127.0.0.1:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "apple-foundationmodel",
    "messages": [{"role": "user", "content": "Name three primary colors, comma separated."}],
    "temperature": 0
  }' | jq .

Response (real, own measurement 2026-06-03):

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "model": "apple-foundationmodel",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "Red, blue, yellow.",
        "refusal": null
      }
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 4,
    "total_tokens": 14
  }
}

The shape follows the OpenAI standard. choices is an array; with apfel it is always a single element (n=1 is fixed). finish_reason: stop means the model completed the answer. The usage object counts tokens: prompt (10), completion (4), total (14).

Extract the content directly:

curl -s http://127.0.0.1:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"apple-foundationmodel","messages":[{"role":"user","content":"Name three primary colors, comma separated."}],"temperature":0}' \
  | jq -r '.choices[0].message.content'
# Red, blue, yellow.

The message roles follow the OpenAI convention: system for the system prompt, user for the input, assistant for earlier model answers in the conversation. A system prompt in the request:

curl -s http://127.0.0.1:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "apple-foundationmodel",
    "messages": [
      {"role": "system", "content": "You answer in exactly one word."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0
  }' | jq -r '.choices[0].message.content'

Streaming with Server-Sent Events

We enable streaming with "stream": true. The protocol is Server-Sent Events (SSE): a single HTTP connection stays open, the server sends the answer as a stream of data: lines, each carrying a JSON object. The connection closes with the sentinel line data: [DONE].

curl -s http://127.0.0.1:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "apple-foundationmodel",
    "messages": [{"role": "user", "content": "Name three primary colors, comma separated."}],
    "temperature": 0,
    "stream": true
  }'

Output (schematic, measured for real):

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","model":"apple-foundationmodel","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","model":"apple-foundationmodel","choices":[{"index":0,"delta":{"content":"Red, blue, yellow."},"finish_reason":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","model":"apple-foundationmodel","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

The order is always the same: first a chunk with delta.role, then chunk(s) with delta.content, then a chunk with finish_reason: stop, then [DONE].

Observation with short answers: the content arrives as a single chunk, not token by token. apfel streams coarse-grained for short outputs. With longer answers we see several content chunks. This is not a protocol violation; the standard prescribes no particular chunk granularity.

When streaming makes sense: when the client should display the answer during generation, for instance in an interface with live output. In the agent we build in later articles, streaming is mainly relevant for tool-calling loops, where the client waits for a call to complete.

Where the protocol diverges from OpenAI

“OpenAI-compatible” does not mean “OpenAI-interchangeable”. These differences matter when switching from a real OpenAI endpoint to apfel:

PointOpenAIapfel
stopup to 4 stop sequencesHTTP 400
nmultiple choices per requestHTTP 400 (always n=1)
logprobstoken log probabilitiesHTTP 400
presence_penalty-2.0 to 2.0HTTP 400
frequency_penalty-2.0 to 2.0HTTP 400
Context windowdepends on model (8k–128k+)4096 tokens
x_context_strategy etc.not presentapfel-specific extension
usage in SSE chunksoptional via stream_options.include_usagenot in chunk; only in non-stream response

The 400 error on unsupported parameters looks like this:

curl -s http://127.0.0.1:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "apple-foundationmodel",
    "messages": [{"role": "user", "content": "Test."}],
    "stop": ["\n"]
  }' | jq .
{
  "error": {
    "message": "Parameter 'stop' is not supported by Apple's on-device model.",
    "type": "invalid_request_error"
  }
}

This is the practical consequence for existing OpenAI SDKs: many SDKs send presence_penalty and frequency_penalty along with default values, even when you do not set them explicitly. Anyone who clamps an SDK onto apfel unchanged runs into 400s. The fields must either be left out explicitly in the SDK call or set to null, where the SDK allows that.

The three x_ parameters (x_context_strategy, x_context_max_turns, x_context_output_reserve) go in the other direction: they extend the protocol beyond OpenAI. They map apfel’s context-management flags from chat mode onto the request body. An OpenAI client ignores them because it does not know them; an apfel-aware client can set them to control overflow behaviour.

The 4096-token context window is the most prominent difference. Current OpenAI production models start at 8k and go up to 128k and more. For the agent we are building, this means: manage context carefully, trim tool outputs, no long file dumps in the request body.

Why we build against the protocol, not against FoundationModels

apfel offers two paths to the model: the CLI (Article 1) and the serve mode. For the agent we choose the serve mode as the interface, for three reasons.

Interchangeability. Every SDK that expects an OpenAI endpoint docks onto http://127.0.0.1:11434. If we want to swap the on-device model for an external one (for tests, for fallback), we change a URL, not the client code.

Testability. curl and shell scripts are the simplest conceivable clients. We can try every endpoint before the first Swift byte, isolate bugs and write smoke tests that run without Xcode. scripts/smoke-serve.sh in the demo repo is an example: three lines of Bash that start the server, query /health and make one chat-completion round-trip.

Separation. The agent logic knows no Foundation Models framework details. It sends HTTP requests and evaluates JSON. The Foundation Model is an interchangeable building block behind the protocol, not a fixed core of the architecture.

The server’s security posture

The default start is conservatively configured. Bind address 127.0.0.1 means: the server is reachable only from the local machine, not from the network. CORS is off, token auth is off.

APFEL_TOKEN protects the server with a bearer token:

APFEL_TOKEN=my-secret apfel --serve

All requests then need:

curl -s http://127.0.0.1:11434/v1/chat/completions \
  -H "Authorization: Bearer my-secret" \
  -H "Content-Type: application/json" \
  -d '...'

Requests without a valid token come back with HTTP 401. The /health endpoint stays publicly reachable even with a token set, which is sensible for liveness probes.

The dangerous combination is --host 0.0.0.0 --cors without APFEL_TOKEN. With that the model is reachable for everyone on the local network and browser requests are allowed. Anyone who opens this up in a corporate network exposes a model endpoint without access control. For local development on your own machine the default (127.0.0.1, no token, CORS off) is sufficient and correct.

Observability via the logs endpoint

With --debug apfel logs every request with its full body and an event trace:

apfel --serve --debug
curl -s http://127.0.0.1:11434/v1/logs | jq .
curl -s http://127.0.0.1:11434/v1/logs/stats | jq .

/v1/logs/stats (real, own measurement 2026-06-03):

{
  "total_requests": 5,
  "total_errors": 1,
  "avg_duration_ms": 503,
  "requests_per_minute": 8.1,
  "max_concurrent": 5,
  "active_requests": 0,
  "uptime_seconds": 37
}

For the agent in later articles, /v1/logs is the most important debugging window. When a tool-calling loop does not do what we expect, in /v1/logs we see what the agent actually sent to the model, not just what we wrote in the Swift code. That is the difference between logging on the client side and logging at the protocol level.

avg_duration_ms: 503 shows the mean request latency for short prompts (own measurement 2026-06-03 with apfel 1.5.1 on macOS 26.3, M-series Mac). The value varies with prompt length and model load; for the agent loop it is a useful reference point.

Demo repo: apfel-coding-agent v0.2

The state of this article is frozen as tag v0.2: https://codeberg.org/rotecodefraktion/apfel-coding-agent/src/tag/v0.2

Set up demo repo apfel-coding-agent v0.2

Clone (if not done already) and check out the tag:

git clone https://codeberg.org/rotecodefraktion/apfel-coding-agent.git
cd apfel-coding-agent
git checkout v0.2
chmod +x scripts/*.sh

New in v0.2 compared to v0.1:

  • docs/serve-protocol.md — endpoint reference + deviation table OpenAI vs. apfel
  • scripts/curl-examples.sh — round-trips over all endpoints incl. SSE streaming and the 400 error path
  • scripts/smoke-serve.sh — starts the server, checks /health, runs one chat-completion round-trip; exit 0 = green

First test:

./scripts/smoke-serve.sh

When the last output shows SMOKE OK, everything runs. The script exits cleanly, the server is stopped afterwards.

Pitfalls from the build

Ollama port collision. apfel’s default port 11434 is the same as Ollama’s. If Ollama runs in the background, curl http://127.0.0.1:11434/ answers with the text Ollama is running instead of apfel JSON. This looks like an apfel error and is none. Fix: stop Ollama or start apfel on a different port (--port 3001).

/v1/logs needs --debug. Without the flag you get an HTTP 400 with the message “Request log stats are only available when the server is started with –debug.” This is not prominent in the base help; the error text delivers the explanation only on the first hit.

400 on unsupported parameters. Many OpenAI SDKs send presence_penalty and frequency_penalty along with default values without you setting them explicitly. This leads straight to a 400 error with apfel. Anyone who clamps an OpenAI SDK on blindly and wonders about 400s should look first at the request body and not at the apfel server. The debugging pattern: rebuild the request with curl by hand and add fields one at a time until the error appears.

What comes next

Article 3 builds the Swift client that docks onto this server. We set up a Swift package, connect via the URLSession-based HTTP layer to /v1/chat/completions and handle SSE streaming in the client. The OpenAI protocol we played through by hand in this article is the foundation the client sits on.


Previous article: apfel from the command line. Next article: The Swift client: first connection to the model (placeholder — link finalised with the publish of Article 3). Repo tag: v0.2.