Auth and RequestContext — Closing the Gate

Auth and RequestContext — Closing the Gate

The gateway from Article 4 is functional but open: every request passes through, errors arrive as unstructured JSON, and nothing stops a client from firing requests in a tight loop. For any production use — or even a simple multi-tenant test setup — we need authentication, rate limiting, and error formats that match what each protocol’s clients actually expect. Anthropic clients expect {"type":"error","error":{"type":"..."}}, OpenAI clients expect {"error":{"message":"...","param":null,"code":null}}. We deliver both, and we do it using Hummingbird 2’s generic RequestContext — the feature we mentioned in the prologue and have not needed until now.

Generic RequestContext

Hummingbird 2 makes Router and Application generic over the request context. So far we have been using BasicRequestContext, which carries nothing beyond Hummingbird’s internal infrastructure. In this article we define GatewayRequestContext and attach two properties to it: the authenticated API key and the associated tenant name.

import Hummingbird

struct GatewayRequestContext: RequestContext {
    var coreContext: CoreRequestContextStorage

    var apiKey: String?
    var tenant: String?

    init(source: ApplicationRequestContextSource) {
        self.coreContext = .init(source: source)
        self.apiKey = nil
        self.tenant = nil
    }
}

CoreRequestContextStorage is Hummingbird’s default store for the logger, decoder, and framework-internal data; we delegate to it rather than replace it so the framework can continue to enforce its own invariants. Our own properties are plain optionals, explicitly set to nil in the initializer.

Middleware carries values into the context by passing a modified copy to next:

var context = context
context.tenant = tenant
return try await next(request, context)

Because GatewayRequestContext is a struct, this creates a new copy — no shared state, no locks, no reasoning about concurrent access required. Vapor solves the same need with request.storage[MyKey.self] = value, an untyped dictionary that gives neither runtime guarantees nor compiler-level checks about what is actually stored. Hummingbird’s approach is more explicit about what middleware contributes and lets you discover mismatches at compile time rather than at runtime.

API Key Authentication

We accept keys via two header conventions: x-api-key (Anthropic style) and Authorization: Bearer (OpenAI style). Either one is enough; the first match wins.

static func extractKey(from request: Request) -> String? {
    if let name = HTTPField.Name("x-api-key"),
       let value = request.headers[name], !value.isEmpty {
        return value
    }
    if let auth = request.headers[.authorization],
       auth.hasPrefix("Bearer ") {
        let key = String(auth.dropFirst("Bearer ".count))
            .trimmingCharacters(in: .whitespaces)
        return key.isEmpty ? nil : key
    }
    return nil
}

The "Bearer " prefix check with a trailing space quietly rejects strings like Bearer-x — no explicit format error needed, because a malformed Authorization header simply yields no key and the request is treated as unauthenticated.

Constant-Time Comparison

A naive string equality check (==) short-circuits as soon as it finds a differing byte. Over a sufficient number of requests, an attacker can infer from latency differences how many bytes of their candidate match the correct key. This is not a theoretical concern — timing attacks against remote endpoints have been demonstrated in practice.

We therefore do not compare strings directly, but compare their SHA-256 hashes byte by byte using XOR accumulation:

static func constantTimeEquals(_ a: Data, _ b: Data) -> Bool {
    guard a.count == b.count else { return false }
    var diff: UInt8 = 0
    for i in 0..<a.count {
        diff |= a[i] ^ b[i]
    }
    return diff == 0
}

Both inputs are SHA-256 hashes (always 32 bytes), so the length guard is a safety invariant, not a timing-sensitive branch on user input: an attacker cannot learn anything from a.count != b.count because both inputs are always the same length. The XOR accumulation always runs 32 iterations regardless of where the keys differ.

Configured keys are hashed once at middleware initialization and stored in a [Data: String] dictionary, so at runtime we only need to hash the incoming key. Rather than a direct dictionary lookup, we iterate over all configured hashes — this keeps timing proportional to the size of the key list rather than to which particular key matched, closing one more potential timing channel.

SHA256 comes from swift-crypto. On macOS it wraps CryptoKit, on Linux it wraps BoringSSL, but the API is identical in both environments — our code needs no #if canImport(CryptoKit) conditional and compiles unmodified on both platforms.

Tenant Configuration

Keys are configured as "tenant:key" pairs, for example "alice:sk-abc123,bob:sk-def456". The middleware parses these pairs at initialization and stores only the hashes — the plaintext keys never leave the initialization block. Pairs without a colon or with an empty tenant name are silently ignored. After successful authentication, context.tenant holds the tenant name ready for the rate limiter that follows in the middleware chain.

Spec-Compliant Error Responses

Hummingbird’s default error format is {"error":"..."}. Both specs differ from that. We define GatewayError as its own error enum:

enum GatewayError: Error {
    case unauthorized(String)
    case badRequest(String)
    case rateLimitExceeded(String, retryAfter: Int)
    case backendUnavailable(String)
    case backendError(String)
    case internalError(String)
}

Each case carries a readable message; the rate limit case additionally carries the retry-after duration as an associated label so that the middleware can write it directly into the response header without parsing it back out of an error string. The mapping from enum case to HTTP status code and protocol-specific type string is encoded directly in the enum:

CaseStatusAnthropic typeOpenAI type
unauthorized401authentication_errorauthentication_error
badRequest400invalid_request_errorinvalid_request_error
rateLimitExceeded429rate_limit_errorrate_limit_exceeded
backendUnavailable503overloaded_errorserver_error
backendError502api_errorserver_error
internalError500api_errorserver_error

GatewayErrorMiddleware catches any GatewayError thrown downstream and determines the format from the request path:

struct GatewayErrorMiddleware: RouterMiddleware {
    typealias Context = GatewayRequestContext

    func handle(_ request: Request, context: Context,
                next: (Request, Context) async throws -> Response) async throws -> Response {
        do {
            return try await next(request, context)
        } catch let error as GatewayError {
            return Self.buildResponse(for: error, path: request.uri.path)
        }
    }
}

Requests to /v1/messages get the Anthropic format; everything else gets OpenAI. A path-prefix check is more reliable than header sniffing because clients sometimes set both header types and the protocol intent is clearer from the route than from any particular header.

One detail specific to the OpenAI format: Swift’s JSONEncoder omits nil optionals entirely by default, so the keys simply disappear from the JSON. The OpenAI spec, however, explicitly requires "param": null and "code": null to be present. We implement encode(to:) manually and call encodeNil(forKey:) for both properties, ensuring they always appear in the output regardless of their value.

Token-Bucket Rate Limiting

The token-bucket algorithm is straightforward: each tenant has a bucket with capacity burst that refills at ratePerMinute / 60 tokens per second. One request consumes one token. Empty bucket returns a 429.

We implement this as a Swift actor that manages a dictionary of per-tenant bucket states:

actor RateLimiter {
    private struct Bucket {
        var tokens: Double
        var lastRefill: Date
    }

    let capacity: Double
    let refillPerSecond: Double
    private var buckets: [String: Bucket] = [:]

    init(ratePerMinute: Int, burst: Int) {
        self.capacity = Double(max(1, burst))
        self.refillPerSecond = Double(max(1, ratePerMinute)) / 60.0
    }

    func consume(tenant: String) -> Int? {
        let now = Date()
        var bucket = buckets[tenant] ?? Bucket(tokens: capacity, lastRefill: now)

        let elapsed = max(0, now.timeIntervalSince(bucket.lastRefill))
        bucket.tokens = min(capacity, bucket.tokens + elapsed * refillPerSecond)
        bucket.lastRefill = now

        if bucket.tokens >= 1.0 {
            bucket.tokens -= 1.0
            buckets[tenant] = bucket
            return nil
        }

        buckets[tenant] = bucket
        let needed = 1.0 - bucket.tokens
        return max(1, Int((needed / refillPerSecond).rounded(.up)))
    }
}

Two design decisions are worth noting. Rather than running a background task that fires every N seconds, we compute the refill at consume time: elapsed * refillPerSecond gives the tokens that have accumulated since the last call. This keeps the actor free of any background lifecycle concerns — there is no Task.detached that needs to be cancelled on shutdown.

New tenants start with a full bucket. A client making its very first request has not had the chance to abuse anything yet, so starting them at zero would be needlessly punishing. They get their full burst allowance immediately; the steady-state rate only kicks in once they have burned through it.

The retry-after value is rounded up to a conservative upper bound. A client that waits the suggested number of seconds will find at least one token ready on the next attempt. Clients that retry earlier will get another 429, which is correct behavior under RFC 6585.

Middleware Wiring

The router switches from BasicRequestContext to GatewayRequestContext:

func buildRouter(
    mlxClient: MLXClient,
    modelID: String,
    keyPairs: [String],
    limiter: RateLimiter
) -> Router<GatewayRequestContext> {
    let router = Router(context: GatewayRequestContext.self)

    router.addMiddleware {
        LogRequestsMiddleware(.info)
        GatewayErrorMiddleware()
    }

    router.get("healthz") { _, _ in HTTPResponse.Status.ok }

    let authed = router.group("v1")
    if !keyPairs.isEmpty {
        authed
            .add(middleware: APIKeyAuthMiddleware(keyPairs: keyPairs))
            .add(middleware: RateLimitMiddleware(limiter: limiter))
    }

    authed.get("models") { ... }
    authed.post("messages") { ... }
    authed.post("chat/completions") { ... }

    return router
}

The order in the middleware chain is deliberate. LogRequestsMiddleware comes first so that even failed auth attempts appear in the log — a request that ends in a 401 is precisely the one you want a record of during an incident. GatewayErrorMiddleware sits immediately behind it and wraps everything downstream, catching any GatewayError thrown by authentication or rate limiting and converting it into the right JSON shape. APIKeyAuthMiddleware then sets context.tenant, and RateLimitMiddleware follows last because it reads the tenant name from the context.

/healthz sits outside the "v1" group and runs through none of the authentication middleware. Liveness probes from Kubernetes or Docker Compose should not require an API key to function.

When keyPairs is empty — because --api-keys "" was passed or the option was omitted entirely — we skip installing the middleware pair. The gateway then behaves exactly as it did in Article 4, with no authentication at all. Backward compatibility achieved without any special-casing in the code.

We explicitly convert body decode errors to GatewayError.badRequest:

let payload: ChatCompletionRequest
do {
    payload = try await request.decode(as: ChatCompletionRequest.self, context: context)
} catch {
    throw GatewayError.badRequest("Invalid request body: \(error.localizedDescription)")
}

Without this conversion, a Hummingbird-internal HTTPError would pass straight through GatewayErrorMiddleware — which only catches GatewayError — and reach the client as an unformatted JSON response that matches neither the Anthropic nor the OpenAI error schema.

CLI Configuration

App.swift gets three new options:

@Option(help: "Comma-separated tenant:key pairs, e.g. 'alice:sk-abc,bob:sk-def'. Empty disables auth.")
var apiKeys: String = ""

@Option(help: "Rate limit per minute per tenant")
var rateLimitPerMinute: Int = 60

@Option(help: "Burst capacity for the rate limit")
var rateLimitBurst: Int = 10

A startup command with auth enabled looks like this:

swift run gateway \
  --api-keys "alice:sk-abc123,bob:sk-def456" \
  --rate-limit-per-minute 60 \
  --rate-limit-burst 10

With 60 requests per minute and a burst of 10, a tenant can issue the first 10 requests back-to-back and then gets throttled to 1 request per second. After a 10-second idle period the bucket is full again and the next burst is available.

For Claude Code as a client, two environment variables are sufficient:

export ANTHROPIC_API_KEY=alice:sk-abc123
export ANTHROPIC_BASE_URL=http://localhost:8080
claude

Claude Code forwards ANTHROPIC_API_KEY as the x-api-key header — exactly what APIKeyAuthMiddleware checks first.

Testing with curl

Start the gateway:

swift run gateway \
  --port 8090 \
  --api-keys "alice:sk-test-alice,bob:sk-test-bob" \
  --rate-limit-per-minute 6 \
  --rate-limit-burst 2

A rate of 6/min and burst 2 makes errors quick to reproduce.

No key, Anthropic path:

curl -s -X POST http://127.0.0.1:8090/v1/messages \
  -H "content-type: application/json" \
  -d '{"model":"x","max_tokens":10,"messages":[{"role":"user","content":"hi"}]}'
{"type":"error","error":{"type":"authentication_error","message":"Missing API key. Provide via x-api-key header or Authorization: Bearer."}}

No key, OpenAI path:

curl -s -X POST http://127.0.0.1:8090/v1/chat/completions \
  -H "content-type: application/json" \
  -d '{"model":"x","messages":[{"role":"user","content":"hi"}]}'
{"error":{"message":"Missing API key. Provide via x-api-key header or Authorization: Bearer.","type":"authentication_error","param":null,"code":null}}

Bearer convention with a valid key:

curl -s -o /dev/null -w "%{http_code}\n" \
  http://127.0.0.1:8090/v1/models \
  -H "authorization: Bearer sk-test-alice"
200

Exceeding the rate limit:

for i in 1 2 3; do
  printf "req $i: "
  curl -s -o /dev/null -w "%{http_code}\n" \
    http://127.0.0.1:8090/v1/models \
    -H "x-api-key: sk-test-alice"
done
req 1: 200
req 2: 200
req 3: 429

The 429 response carries a Retry-After header with the suggested wait time in seconds. Alice and Bob have completely separate buckets; an exhausted alice allowance has no effect on bob’s requests.

Health check without a key:

curl -s -o /dev/null -w "%{http_code}\n" http://127.0.0.1:8090/healthz
200

Streaming and Backend Stay Untouched

Streaming from Article 4 continues to work without any modification. Auth and rate limiting run as middleware before the route handler commits the response, so once the handler returns a Response object the HTTP status is already set and the streaming body begins to flow — middleware can no longer interrupt it. Auth errors therefore always arrive as 401 before the stream starts, never mid-stream.

MLXClient, all Anthropic and OpenAI model types, and StreamingResponses.swift remain completely untouched. The middleware chain sits in front of the backend call, so the backend — whether mlx_lm.server on macOS or Ollama on Linux — only ever sees requests that have already been authenticated and charged against the rate limit. From its perspective the gateway is still a regular HTTP client.

Next: Observability and Deployment

The gateway is now a closed service. What is still missing is visibility: without metrics you cannot tell whether the rate limit is firing, what the backend latency looks like, or whether errors are accumulating. Article 6 covers observability: Prometheus metrics, structured logging, and what to watch for when cross-compiling with the Swift Static Linux SDK for anyone who wants to deploy the gateway as a container.