Understanding tool calling: from schema to round-trip

Article 4 · Series: A Local Coding Agent with apfel

In Article 3 the connection was in place: prompt in, streamed answer out. That lets the model talk, but not act. The step from chat to agent is tool calling — the model may invoke tools, we run them and feed the result back. In this article we build exactly that mechanism: a tool definition in the OpenAI schema, the round-trip of invocation, execution and continuation, and a small abstraction in Swift that captures a tool as a protocol and a registry. The demo is get_time, a trivial tool with no side effects. Along the way we check how reliably the small model plays along — and run into a few quirks that cannot be derived from the OpenAI standard. The state is frozen as tag v0.4: https://codeberg.org/rotecodefraktion/apfel-coding-agent/src/tag/v0.4

What a tool definition promises the model

To the model, a tool is at first just a description: a name, a sentence of explanation and a JSON schema of its parameters. In the OpenAI format that apfel passes through to the Foundation Model, the definition of get_time looks like this:

{
  "type": "function",
  "function": {
    "name": "get_time",
    "description": "Get the current date and time in ISO-8601 format. Takes no arguments.",
    "parameters": { "type": "object", "properties": {}, "required": [] }
  }
}

That is all the model sees. It knows neither the code behind the tool nor its return value — only the promise that there is a get_time taking no arguments. On the basis of this description it decides whether and how to call the tool. In Swift we model the same structure as Codable types:

public struct ToolDefinition: Codable, Sendable, Equatable {
    public let type: String          // always "function"
    public let function: FunctionDefinition
}

public struct FunctionDefinition: Codable, Sendable, Equatable {
    public let name: String
    public let description: String
    public let parameters: JSONSchema
}

public struct JSONSchema: Codable, Sendable, Equatable {
    public let type: String
    public let properties: [String: Property]
    public let required: [String]
}

We keep JSONSchema deliberately minimal: object type, typed properties, a required list. For get_time the properties are empty. That is enough for now; richer schemas arrive when a tool needs them — the real file and shell tools in Article 5.

The round-trip: invocation, execution, continuation

Tool calling is not a single request but a small choreography across two calls:

We send the conversation plus tool definitions to /v1/chat/completions.
Instead of a text answer, an assistant message comes back with tool_calls and finish_reason: "tool_calls". The content is then null.
We run each tool call and append the result as a role: "tool" message, linked via the tool_call_id.
We send the extended conversation again. Now the model answers with text — the final answer.

How exactly apfel returns the tool call, we read off the actual response rather than deriving it from the OpenAI standard. apfel’s raw response to the get_time request:

{
  "choices": [{
    "finish_reason": "tool_calls",
    "message": {
      "content": null,
      "role": "assistant",
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_time",
          "arguments": "{\"current_time\": \"2023-10-29T15:48:30.567Z\"}"
        }
      }]
    }
  }]
}

Two details stand out. First, function.arguments is a string, not an object — more on that shortly. Second, the model invented a current_time argument even though get_time’s schema has no parameters at all. That is not a fluke but a trait of the small model, one that occupies us later.

The tool as a protocol

So the round-trip does not need bespoke code per tool, we capture a tool as a protocol. The signature is deliberately aligned with the wire format — what actually goes over HTTP — not with Swift convenience:

public protocol Tool: Sendable {
    var name: String { get }
    var description: String { get }
    var parametersSchema: JSONSchema { get }

    func call(_ arguments: Data) async throws -> String
}

Data in, because the model delivers the arguments as a JSON string — each tool decodes them itself and is therefore also responsible for catching broken arguments. String out, because the result goes back as the content of the role: tool message. We derive the tool definition from the tool itself, so schema and implementation cannot drift apart:

extension Tool {
    public var definition: ToolDefinition {
        ToolDefinition(function: FunctionDefinition(
            name: name, description: description, parameters: parametersSchema
        ))
    }
}

The registry: looking tools up and offering them

The registry has exactly two jobs, one per direction of the round-trip. On the way out it collects all definitions for the request; on the way back it looks up the tool the model named and runs it.

public struct ToolRegistry: Sendable {
    private var tools: [String: any Tool]

    public var definitions: [ToolDefinition] {
        tools.values.map(\.definition)
    }

    public func dispatch(name: String, arguments: Data) async throws -> String {
        guard let tool = tools[name] else {
            throw ToolError.unknownTool(name)
        }
        return try await tool.call(arguments)
    }
}

An unknown name becomes a typed ToolError, not a crash. That is the first line of defence against hallucinated tool names — and it is needed as soon as the model makes up a name it was never offered.

arguments is a string, not an object

The obvious assumption is that a tool call is a function invocation with ready-made object arguments. In fact function.arguments arrives as a JSON string in the response body — a string that must be parsed first, and on the small model this string is not guaranteed valid against the schema. We saw above that get_time with no parameters got back a current_time argument. Treat the string as an object unchecked, or pass it straight through, and you get crashes or silently wrong calls.

So arguments stays a raw string in our ToolCall type, and decoding is the tool’s job — the tool may reject the arguments or, like get_time, simply ignore them:

public struct ToolCall: Codable, Sendable, Equatable {
    public let id: String
    public let type: String
    public let function: FunctionCall
    public let index: Int?           // set only when streamed

    public struct FunctionCall: Codable, Sendable, Equatable {
        public let name: String
        public let arguments: String  // raw JSON string
    }
}

The demo tool get_time

get_time is deliberately trivial: no parameters, no side effects, a predictable result. It shows the round-trip without the safety machinery that writing tools will need. We inject the clock so the tool stays testable:

public struct GetTimeTool: Tool {
    public let name = "get_time"
    public let description = "Get the current date and time in ISO-8601 format. Takes no arguments."
    public let parametersSchema = JSONSchema()

    private let now: @Sendable () -> Date

    public func call(_ arguments: Data) async throws -> String {
        // The model sometimes sends arguments even though the schema is empty.
        // We ignore them: get_time takes none.
        let payload = ["time": ISO8601DateFormatter().string(from: now())]
        return String(decoding: try JSONEncoder().encode(payload), as: UTF8.self)
    }
}

The round-trip itself lives in its own type. It performs exactly one pass — not a loop. If the model calls a tool again after the results, that second round is not executed here; the full plan/act/observe loop is Article 7.

public func run(_ messages: [ChatMessage]) async throws -> Result {
    var conversation = messages
    let first = try await complete(conversation, toolChoice)

    guard let calls = first.choices.first?.message.toolCalls, !calls.isEmpty else {
        // The model answered directly; no tool needed.
        return Result(finalContent: first.choices.first?.message.content, toolCalls: [])
    }

    conversation.append(ChatMessage(assistantToolCalls: calls))
    for call in calls {
        conversation.append(ChatMessage(toolCallID: call.id, content: await result(for: call)))
    }

    let final = try await complete(conversation, nil)
    return Result(finalContent: final.choices.first?.message.content, toolCalls: calls)
}

A failing or unknown tool comes back as a result, not a thrown error — so the model gets a chance to recover instead of the round-trip aborting:

private func result(for call: ToolCall) async -> String {
    do {
        return try await registry.dispatch(name: call.function.name,
                                            arguments: Data(call.function.arguments.utf8))
    } catch {
        let payload = ["error": String(describing: error)]
        return (try? String(decoding: JSONEncoder().encode(payload), as: UTF8.self))
            ?? #"{"error":"tool failed"}"#
    }
}

A round-trip against the real model

The CLI gets a --tools path that stocks the registry with get_time and triggers the round-trip. The tool calls go to stderr so that stdout carries only the final answer:

$ swift run apfel-agent --tools "What time is it right now? Use the get_time tool."
→ tool call: get_time({"current_time": "2025-02-02T14:34:56.789Z"})
The current time is June 6, 2026, at 9:12 PM UTC.

The whole arc is visible here: the model calls get_time (with an invented argument that get_time ignores), gets the real time back as the result and formulates an answer from it in natural language. The round-trip stands.

Why an abstraction instead of a special case

Protocol, registry and derived definition are justified as docs/adr/002-tool-abstraktion.md in the repo. The core: the definition is generated from the tool, not maintained as a second, separate record — schema and code cannot drift. Schema encoding, tool-call decoding and dispatch are tested offline against recorded fixtures and a scripted fake backend, with no running apfel. And because arguments are treated as an untrusted string and tool failures are returned as results, the round-trip holds even when the model works sloppily. What exactly “works sloppily” means, we determined empirically.

Where the small model wobbles

Tool calling works — but unreliably, and in a way we only found out by measuring. We ran get_time against apfel 1.5.1 repeatedly and counted how often the model actually makes the tool call.

Variant	tool-call rate
directive prompt, no `tool_choice`	12/15
directive prompt, `tool_choice: "auto"`	6/15
neutral prompt, no `tool_choice`	2/10
directive prompt, `tool_choice: "required"`	1/15

Source: own data, apfel 1.5.1, 2026-06-06, scripts/tool-choice-experiment.sh.

Three findings stand out. First: even when we name the tool explicitly, the model calls it in only about four of five runs — without explicit naming, far less often. Second, and counterintuitively: tool_choice: "required", which in the OpenAI standard forces a tool call, does the opposite on apfel’s Foundation Model. The model refuses:

$ # request with tool_choice: "required"
"content": "I'm sorry, but I can't assist with that.", "finish_reason": "stop"

Third, omitting tool_choice clearly beats the explicit value "auto" (12/15 against 6/15), even though both nominally mean the same thing. From these findings follows a concrete decision: our agent leaves tool_choice off. The mechanism stays in the code for later articles, but the demo forces nothing.

There is a third failure type beyond the missing call and the hallucinated argument. In one run the model called get_time correctly, got the real time back — and still answered “It seems there was an error retrieving the current time.” The tool ran without fault; the model misread its result. This is the kind of behaviour that separates a local 3-billion-parameter agent from a cloud model — and the reason Article 6 evaluates the performance limit systematically rather than glossing over it.

Demo repo: apfel-coding-agent v0.4

The state of this article is frozen as tag v0.4: https://codeberg.org/rotecodefraktion/apfel-coding-agent/src/tag/v0.4

Setting up the demo repo apfel-coding-agent v0.4

Clone (if you haven’t already) and check out the tag:

git clone https://codeberg.org/rotecodefraktion/apfel-coding-agent.git
cd apfel-coding-agent
git checkout v0.4

New in v0.4 over v0.3:

Sources/AgentCore/Tools/ — Tool protocol, ToolRegistry, schema types, GetTimeTool, ToolRoundTrip
Sources/AgentCore/Client/ChatModels.swift — extended with tools, tool_calls, role: tool and tool_choice
Sources/apfel-agent/AgentCommand.swift — new --tools path
docs/adr/002-tool-abstraktion.md — the tool abstraction
scripts/smoke-tool.sh — end-to-end test of the round-trip
scripts/tool-choice-experiment.sh — reproduces the tool-call rate

Build, test, run:

swift build
swift test                        # offline, no apfel needed
swift run apfel-agent --tools "What time is it? Use the get_time tool."

The unit tests run without apfel. The end-to-end test and the experiment need a running apfel serve:

./scripts/smoke-tool.sh
./scripts/tool-choice-experiment.sh

The tool call is unreliable, so the smoke test retries and turns green as soon as one run completes the round-trip. Prerequisites are in docs/setup.md.

Pitfalls from the build

arguments is a string, not an object. This is the most common misconception. Expect a [String: Any] in your Codable model and decoding fails. The value is a JSON string inside the JSON, and it must be treated, decoded and validated as such.

Consume tool calls non-stream. In the stream apfel delivers the tool call twice: first as raw delta.content text fragments, then as a single bundled delta.tool_calls chunk at the end. Collect only delta.content in the stream and you mistake the raw tool-call JSON for the answer. For the round-trip we take the non-stream response; it is simpler and unambiguous.

tool_choice: "required" forces nothing. Unlike the OpenAI standard suggests, the Foundation Model refuses under required instead of calling a tool. Try to force reliability through this parameter and you get the opposite.

Omitting tool_choice beats "auto". Setting the default explicitly measurably worsens the rate. When in doubt, leave the parameter off entirely.

What comes next

get_time is harmless: no arguments, no side effects. Article 5 takes on the first real tools — read_file, list_dir, write_file, run_shell. With them the agent changes something outside itself for the first time, and that is exactly where we need what get_time does not: a path sandbox, confirmation gates before writing actions, and showing diffs before the human decides. The tool abstraction from this article carries all of it — the new tools are just more Tool implementations in the same registry.

Previous article: The Swift client: first connection to the model. Next article: The first real tools: file system and shell. Repo tag: v0.4.