Real Inference — MLXClient and the Local Model

Real Inference — MLXClient and the Local Model

Article 2 implemented the protocol handshake: two endpoints, two formats, mock responses. Now a real model goes on the other end. MLXClient connects our Hummingbird gateway to mlx_lm.server, a Python-based OpenAI-compatible inference server that loads MLX models on Apple Silicon. After this article, the gateway delivers real answers.

Installing and starting mlx_lm.server

One-time setup — requires Apple Silicon (M-chip), macOS 26, Python 3.11+.

1. Installation

uv tool install mlx-lm

2. Start the server

mlx_lm.server --model mlx-community/Qwen3-8B-4bit --port 8081

The first start downloads the model (~5 GB). After that, the server starts immediately. The default port for mlx_lm.server is 8080; we use 8081 so our gateway can continue running on 8080.

3. Verify

curl http://localhost:8081/v1/models

Returns a list of loaded models. If that works, the server is ready.

Model recommendations

ModelRAMNotes
mlx-community/Llama-3.2-3B-Instruct-4bit~3 GBFast, good for 8 GB Macs
mlx-community/Qwen3-8B-4bit~6 GBSweet spot for coding tasks
mlx-community/Mistral-7B-Instruct-v0.3-4bit~5 GBStrong in English, follows instructions well
mlx-community/Qwen3-14B-4bit~10 GBM3 Max+, noticeably higher quality
Linux: Ollama as backend

Alternative for Linux servers — mlx_lm.server runs on Apple Silicon only.

mlx_lm.server is macOS-only. On Linux, use Ollama instead. It exposes the same OpenAI-compatible API, and the same gateway works without any changes.

1. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

2. Pull a model

ollama pull qwen3:8b

3. Start the gateway (pointing to Ollama)

swift run gateway \
  --mlx-url http://localhost:11434 \
  --mlx-model qwen3:8b

Ollama runs on port 11434 by default and responds at /v1/chat/completions. The gateway does not notice the difference.

Two Servers, One Data Flow

From this article on, two local processes run:

  • mlx_lm.server on port 8081: loads the model, manages context, runs inference
  • swift-mlx-gateway on port 8080: accepts client requests, translates formats, routes to the backend

mlx_lm.server is OpenAI-compatible: it accepts POST /v1/chat/completions in OpenAI format. That is the glue. Our gateway speaks both protocols externally (Anthropic and OpenAI), but internally always speaks OpenAI. The conversion happens in the gateway before the request reaches mlx_lm.server.

Claude Code  →  POST /v1/messages (Anthropic)         →  Gateway
Cursor       →  POST /v1/chat/completions (OpenAI)    →  Gateway
                                                              ↓
                            POST /v1/chat/completions (OpenAI internal)
                                                              ↓
                                                  mlx_lm.server :8081

mlx_lm.server uses the model field in the request as a HuggingFace repo ID. That means: what we pass via --mlx-model must match exactly what mlx_lm.server loaded at startup. The gateway forwards this value unchanged in /v1/models so clients know which model is available.

MLXClient

MLXClient is a Swift actor. That is not a style choice; it is a requirement. Route handlers run on different tasks, and an actor ensures that access to internal state (URLSession configuration, logger) is serialised without us writing locks ourselves.

//  Sources/gateway/MLX/MLXClient.swift

import Foundation
import Logging

enum MLXError: Error, LocalizedError {
    case invalidURL(String)
    case backendUnavailable(String)
    case inferenceError(Int, String)
    case decodingError(String)

    var errorDescription: String? {
        switch self {
        case .invalidURL(let url):
            return "Invalid MLX backend URL: \(url)"
        case .backendUnavailable(let reason):
            return "MLX backend unavailable: \(reason)"
        case .inferenceError(let status, let body):
            return "MLX backend returned HTTP \(status): \(body)"
        case .decodingError(let detail):
            return "Failed to decode MLX response: \(detail)"
        }
    }
}

actor MLXClient {
    private let baseURL: String
    private let session: URLSession
    private let logger: Logger

    init(baseURL: String, logger: Logger) {
        self.baseURL = baseURL.hasSuffix("/") ? String(baseURL.dropLast()) : baseURL
        self.logger = logger

        let config = URLSessionConfiguration.default
        config.timeoutIntervalForRequest = 120
        config.timeoutIntervalForResource = 300
        self.session = URLSession(configuration: config)
    }

    func complete(
        messages: [ChatMessage],
        model: String,
        maxTokens: Int = 1024,
        temperature: Double = 1.0
    ) async throws -> (text: String, inputTokens: Int, outputTokens: Int) {
        guard let url = URL(string: "\(baseURL)/v1/chat/completions") else {
            throw MLXError.invalidURL(baseURL)
        }

        let body = ChatCompletionRequest(
            model: model,
            messages: messages,
            maxTokens: maxTokens,
            temperature: temperature,
            topP: nil,
            stream: false,
            stop: nil,
            presencePenalty: nil,
            frequencyPenalty: nil,
            user: nil
        )

        var request = URLRequest(url: url)
        request.httpMethod = "POST"
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")
        request.httpBody = try JSONEncoder().encode(body)

        logger.debug("MLX request", metadata: [
            "model": .string(model),
            "messages": .string("\(messages.count)"),
        ])

        let (data, response): (Data, URLResponse)
        do {
            (data, response) = try await session.data(for: request)
        } catch let urlError as URLError {
            throw MLXError.backendUnavailable(urlError.localizedDescription)
        }

        guard let http = response as? HTTPURLResponse else {
            throw MLXError.backendUnavailable("Non-HTTP response")
        }
        guard http.statusCode == 200 else {
            let errorBody = String(data: data, encoding: .utf8) ?? "(no body)"
            throw MLXError.inferenceError(http.statusCode, errorBody)
        }

        let decoded: ChatCompletionResponse
        do {
            decoded = try JSONDecoder().decode(ChatCompletionResponse.self, from: data)
        } catch {
            throw MLXError.decodingError(error.localizedDescription)
        }

        let text = decoded.choices.first?.message.content ?? ""
        let inputTokens = decoded.usage.promptTokens
        let outputTokens = decoded.usage.completionTokens

        logger.debug("MLX response", metadata: [
            "input_tokens": .string("\(inputTokens)"),
            "output_tokens": .string("\(outputTokens)"),
        ])

        return (text: text, inputTokens: inputTokens, outputTokens: outputTokens)
    }
}

Three details are worth noting.

The timeouts: timeoutIntervalForRequest=120s applies per request. Local inference on an 8B model takes between 2 and 60 seconds depending on the query, occasionally longer. 120 seconds is a practical limit. timeoutIntervalForResource=300s is the hard total ceiling per connection.

The return tuple (text, inputTokens, outputTokens) carries real token counts from the MLX tokeniser. These are no longer heuristics; they are the actual values from the model response, passed through to the client unchanged.

stream: false is set explicitly. mlx_lm.server defaults to non-streaming, but we make our intent clear. Streaming arrives in Article 4.

Format Conversion: Anthropic to OpenAI

mlx_lm.server speaks only OpenAI. Anthropic requests must be converted before reaching the backend. The function toOpenAIMessages(from:) does exactly that:

private func toOpenAIMessages(from request: MessageRequest) -> [ChatMessage] {
    var result: [ChatMessage] = []
    if let system = request.system {
        result.append(ChatMessage(role: .system, content: system.asText))
    }
    result.append(contentsOf: request.messages.map {
        ChatMessage(role: $0.role == .user ? .user : .assistant, content: $0.content.asText)
    })
    return result
}

The Anthropic system prompt is a dedicated top-level field. OpenAI has no such field; the system prompt comes as the first message with role: "system". We insert it at the start of the array.

AnthropicContent.asText normalises the content value to a string, regardless of whether it arrived as a plain string or as a block array. That is the moment where the structural difference between the protocols from Article 2 becomes load-bearing: whatever the client sends must be flattened for the backend.

OpenAI requests need no conversion; payload.messages goes directly into mlxClient.complete().

Error Mapping with withMLXError

MLXError has four cases. So they arrive at the client as meaningful HTTP responses, they are translated into Hummingbird HTTPError:

private func withMLXError<T>(_ operation: () async throws -> T) async throws -> T {
    do {
        return try await operation()
    } catch let error as MLXError {
        switch error {
        case .backendUnavailable:
            throw HTTPError(.serviceUnavailable, message: error.localizedDescription)
        case .inferenceError:
            throw HTTPError(.badGateway, message: error.localizedDescription)
        case .invalidURL, .decodingError:
            throw HTTPError(.internalServerError, message: error.localizedDescription)
        }
    }
}

503 Service Unavailable: mlx_lm.server is unreachable. Connection refused, timeout, DNS failure — all land here. The client knows: the backend is temporarily unavailable, retrying makes sense.

502 Bad Gateway: mlx_lm.server is reachable but responds with an error status code. This can be a model-internal error or a request the model rejects.

500 Internal Server Error: a fault on our side. Invalid URL configuration or a response we could not decode.

Non-MLXError exceptions (e.g. Hummingbird’s own decode failures) are not caught by withMLXError and propagate normally.

Router and Application

buildRouter now takes mlxClient and modelID as parameters. The route handlers are correspondingly lean:

// Application+build.swift
protocol AppArguments {
    var host: String { get }
    var port: Int { get }
    var logLevel: Logger.Level { get }
    var mlxURL: String { get }
    var mlxModel: String { get }
}

func buildApplication(_ args: some AppArguments) async throws -> some ApplicationProtocol {
    var logger = Logger(label: "swift-mlx-gateway")
    logger.logLevel = args.logLevel

    let mlxClient = MLXClient(baseURL: args.mlxURL, logger: logger)
    let router = buildRouter(mlxClient: mlxClient, modelID: args.mlxModel)
    // ...
}
// Anthropic Messages API route (from Router+build.swift)
router.post("v1/messages") { request, context -> MessageResponse in
    let payload = try await request.decode(as: MessageRequest.self, context: context)
    try validate(payload)
    let messages = toOpenAIMessages(from: payload)
    let result = try await withMLXError {
        try await mlxClient.complete(
            messages: messages,
            model: modelID,
            maxTokens: payload.maxTokens,
            temperature: payload.temperature ?? 1.0
        )
    }
    return buildAnthropicResponse(
        text: result.text,
        model: modelID,
        inputTokens: result.inputTokens,
        outputTokens: result.outputTokens
    )
}

Five steps: decode, validate, convert, complete, build response. The complexity lives in the extracted helpers, not in the route body.

The two new CLI arguments in App.swift:

@Option(help: "Base URL of the local MLX inference server (mlx_lm.server)")
var mlxURL: String = "http://localhost:8081"

@Option(help: "MLX model ID (must match mlx_lm.server --model)")
var mlxModel: String = "mlx-community/Qwen3-8B-4bit"

--mlx-model must match exactly the model ID that mlx_lm.server loaded at startup via --model. The gateway forwards this value to mlx_lm.server for both the Anthropic and OpenAI endpoints and advertises it under /v1/models. Whatever clients send as model in their request is ignored by the gateway: it always uses the value configured via --mlx-model for the backend call.

Starting and Testing

Two terminals:

# Terminal 1: MLX backend
mlx_lm.server --model mlx-community/Qwen3-8B-4bit --port 8081

# Terminal 2: Gateway
swift run gateway \
  --mlx-url http://localhost:8081 \
  --mlx-model mlx-community/Qwen3-8B-4bit

Anthropic endpoint:

curl -s -X POST localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-8B-4bit",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Write a haiku about Swift."}]
  }' | jq '.content[0].text'

The model responds with a real haiku. Token counts in the response are the actual values from the MLX tokeniser.

Important for reasoning models like Qwen3 or DeepSeek-R1: max_tokens needs generous headroom. The reasoning trace (thinking before answering) consumes several hundred tokens on its own. With 256 tokens, there is often nothing left for the actual answer. 1024 is a sensible lower bound; 2048 or more for complex requests.

Qwen3 additionally supports a switch that skips reasoning entirely: append the marker /no_think to the user message. The tokenizer recognises it, emits an empty thinking block, and jumps straight to the answer. Useful for short queries where reasoning is not needed:

{"role": "user", "content": "Write a haiku about Swift. /no_think"}

OpenAI endpoint:

curl -s -X POST localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-8B-4bit",
    "messages": [{"role": "user", "content": "Explain Structured Concurrency in two sentences."}]
  }' | jq '.choices[0].message.content'

Claude Code against the local backend:

export ANTHROPIC_BASE_URL=http://localhost:8080
claude

Unlike Article 2, real responses come back now. The model understands context, follows instructions, and produces meaningful output.

Behaviour Without a Running Backend

If mlx_lm.server is not running when a request arrives:

curl -s -i -X POST localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model":"x","max_tokens":256,"messages":[{"role":"user","content":"test"}]}'
HTTP/1.1 503 Service Unavailable
{"error":{"message":"MLX backend unavailable: Could not connect to the server."}}

The gateway does not crash. It logs the error, responds with 503, and is immediately ready for the next request. When mlx_lm.server starts afterwards, the next request works without restarting the gateway.

Commit and Tag

git add .
git commit -m "article-03: MLXClient — real inference via mlx_lm.server"
git tag article-03
git push origin main --tags

Six Files, Real Inference

FileChange
App.swiftAdded --mlx-url, --mlx-model
Application+build.swiftCreates MLXClient, passes to router
Router+build.swiftMock functions removed, conversion + error mapping
MLX/MLXClient.swiftNew: actor-based HTTP client
Models/Anthropic.swiftUnchanged
Models/OpenAI.swiftUnchanged

The Codable types from Article 2 are the stable contract. They were not touched once for Article 3.

Article 4 brings streaming. The response appears token by token instead of as a block, as Claude Code and other tools expect.

Sources