Benchmarks — Hummingbird against FastAPI and Fastify

Benchmarks — Hummingbird against FastAPI and Fastify

Six articles in, we have built the gateway without worrying about numbers. In the last article we deployed it; now we look at what it delivers. We measure the Hummingbird gateway from Article 6 against two functionally equivalent implementations: one in FastAPI with uvicorn, one in Fastify 5. Same endpoints, same auth logic, same proxy to the backend, same middleware stack. The only difference is the framework underneath.

What we measure and what we do not

An LLM gateway operates in two very different modes. First, a light path: /healthz, /v1/models, auth errors — short, synchronous responses without a backend call. Second, the heavy path: /v1/messages with a real backend, where inference time dominates everything. We measure both, and they tell different stories.

What we do not measure: end-to-end with a real model. That would be a model benchmark, not a framework benchmark. Instead we use a deterministic stub that scales backend latency in a controlled way — more on that below.

Also not measured: streaming throughput. SSE is event-driven; RPS numbers for a streaming endpoint are misleading. The qualitative streaming behavior gets its own section.

Hardware and tools: M4 Air, 24 GB Unified Memory. Passive cooling means the device can enter thermal throttle territory under sustained load. We therefore use 30-second bursts rather than sustained runs. Load generator is oha 1.14, 50 concurrent connections. All numbers: measured 2026-05-20.

The stub backend

Real inference is unsuitable for a framework comparison — too much variance, too hard to control. We replace mlx_lm.server with a Python stub that mimics the OpenAI API and scales latency deterministically: ten milliseconds per hundred estimated input tokens (chars / 4). A request with a 200-character message (~50 tokens) sleeps for 5 ms before responding. With 2000 characters (~500 tokens) that is 50 ms; with 8000 characters (~2000 tokens) it is 200 ms.

This gives us three clean measurement regimes:

Payload sizeEstimated tokensStub delayCharacter
small~50~5 msFramework still visible
medium~500~50 msMixed
large~2000~200 msBackend dominates

The first run was unfair — and why

Anyone measuring the same gateway in three languages needs to watch for comparability. The Article 6 gateway runs every request through four middlewares: LogRequestsMiddleware, MetricsMiddleware, TracingMiddleware, and GatewayErrorMiddleware. A naive comparison implementation in FastAPI or Fastify would not have had these layers — and therefore less per-request overhead.

A first benchmark run with this asymmetry showed Hummingbird at /healthz around 40k RPS, Fastify at 64k. That looked dramatic and was nonetheless misleading: we were comparing a production stack against a bare-bones one. To fix this, FastAPI and Fastify got the same four layers.

FastAPI receives an ASGI middleware with opentelemetry.trace.NoOpTracer, prometheus_client Counter and Histogram, and a try/except error handler. Fastify receives onRequest/onResponse hooks with @opentelemetry/api, prom-client, and a setErrorHandler. Counters and histograms carry the same names and labels as in Hummingbird (hb_requests, http_server_request_duration). The NoOp tracing costs a little and does nothing — exactly as in Hummingbird where no OTel collector is configured.

All three gateways expose /metrics with the Prometheus text format. The benchmark sends every request through auth, metrics, tracing, and the error handler in all three frameworks.

Pure framework overhead — the healthz path

GET /healthz has no backend, no database, no computation. The response is a static JSON object. What is measured here is close to pure framework routing overhead plus the middleware chain.

Measured 2026-05-20, M4 Air 24 GB, oha -z 30s -c 50:

GatewayRPSP50P95P99
Hummingbird (1 process)67 9480.2 ms2.9 ms4.0 ms
FastAPI (4 workers)66 2860.2 ms2.7 ms3.8 ms
Fastify (1 process)61 2770.3 ms2.5 ms3.5 ms

Hummingbird edges out FastAPI — with one critical difference: FastAPI runs with four worker processes, Hummingbird with one. Per process, FastAPI comes to about 16k RPS; Hummingbird reaches 68k. That is not a small gap. Fastify also runs as a single process and lands at 61k.

The claim is not “Swift is five times faster than Python.” The claim is: SwiftNIO lets a single process exploit all CPU cores through event-loop pinning, while CPython needs multiple processes because of the GIL. This feeds directly into the next topic.

With backend latency — where the framework delta disappears

Once a real backend call is in the pipeline, the picture changes fundamentally.

Anthropic/small (~5 ms backend delay), measured 2026-05-20:

GatewayRPSP50P95P99
Hummingbird88356.2 ms65.0 ms69.6 ms
FastAPI90354.6 ms62.8 ms66.9 ms
Fastify87756.1 ms64.3 ms70.5 ms

The gaps are now within measurement noise — all three sit within 3% RPS and 5 ms P50. With a 5 ms backend delay, the framework overhead of under 1 ms is already nearly irrelevant.

With 50 ms and 200 ms backend delay the picture becomes even clearer:

Anthropic/medium (~50 ms), measured 2026-05-20:

GatewayRPSP50
Hummingbird111439.7 ms
FastAPI114427.3 ms
Fastify114428.2 ms

Anthropic/large (~200 ms), measured 2026-05-20:

GatewayRPSP50
Hummingbird30.91633.0 ms
FastAPI30.91655.8 ms
Fastify30.91652.4 ms

At 200 ms backend latency, all three gateways are on the same field. The P50 sits at 1630-1650 ms — almost entirely the 50 parallel connections queued in front of the stub backend, not framework overhead. With a real model that takes 10-50 ms per token, the effect would be even more pronounced.

The punchline: for an LLM gateway, the choice of framework is irrelevant for end-to-end latency and throughput. The model dominates.

Memory and startup — where architectural decisions show

The picture changes when we measure operating costs rather than RPS: how much memory does the process consume at idle, and how long does a cold start take?

Measured 2026-05-20, M4 Air, after warmup (all workers loaded):

GatewayStartupIdle RSS totalProcesses
Hummingbird44 ms16 MiB1
FastAPI472 ms349 MiB6 (master + watcher + 4 workers)
Fastify211 ms75 MiB1

The FastAPI figure of 349 MiB is the sum of all six processes. A single uvicorn worker brings roughly 75 MiB — five of those plus master and watcher add up to 349 MiB. Fastify comes to 75 MiB as a single Node.js process. Hummingbird runs as a single process at 16 MiB.

This is a structural property, not a tuning problem. Python imports the entire FastAPI/Pydantic/httpx/prometheus_client ecosystem at startup. Node.js loads V8 and the npm dependencies. SwiftNIO ships as a statically linked framework inside the binary; there are no runtime imports.

For a deployment on a single home server this makes no difference. For running many gateway instances on a node, for edge deployments with limited RAM, or for function-as-a-service environments, it is a real factor.

Deployment artifacts

One final size comparison — what actually lands on the target host?

GatewayArtifactSize
HummingbirdStatic Linux SDK binary18 MB
FastAPISource code28 KB + Python runtime + venv
Fastifydist/server.js + node_modules8 KB + 54 MB

The Hummingbird binary from the Swift Static Linux SDK is completely self-contained — no Python, no Node.js, no runtime dependencies. For Fastify the opposite is true: the compiled code is tiny, but the node_modules tree grows with every dependency. This feeds directly into container image sizes.

Streaming — qualitative, no numbers

RPS metrics for SSE endpoints are misleading: streaming responses stay open for seconds to minutes, making throughput measurements in the classical sense meaningless. Three qualitative points instead, directly readable in the code:

Cancellation on client disconnect. When the client drops the connection, the server must abort the backend request. In Hummingbird this propagates through AsyncStream.onTerminationTask.cancel() → URLSession is aborted — without extra code, because Swift Structured Concurrency threads this through automatically. In Fastify you need to explicitly wire reply.raw.on('close', ...). In FastAPI, Request.is_disconnected() is a polling API that you have to call inside the generator function.

Backpressure. What happens when the client reads more slowly than the backend delivers? SwiftNIO has backpressure baked in — the write buffer stalls when the client stops reading, and the backend task is throttled. Node.js streams have highWaterMark that needs to be configured. FastAPI’s StreamingResponse has no built-in backpressure mechanism; with a slow client, the buffer grows in the worker’s memory.

Concurrent streams. All three handle high SSE concurrency. Hummingbird and Fastify scale over the event loop; FastAPI streaming runs through async generators in individual worker processes, which makes multiple workers particularly important when many streams run in parallel.

These are observations from the code — not measured, but visible in the design.

When each choice makes sense

Hummingbird / Swift pays off structurally when:

  • Memory footprint on the target host is a factor (edge, many instances)
  • Cold-start latency matters (FaaS, Kubernetes scale-from-zero)
  • Deployment artifact size counts (air-gapped environments, container registry traffic)
  • The team writes Swift or has iOS experience that transfers to server code

FastAPI or Fastify make sense when:

  • The team works in Python or TypeScript
  • The ecosystem matters (tooling, libraries, community)
  • Gateway-layer throughput and latency are not a differentiator — and for an LLM gateway they almost never are

What is not an argument: “FastAPI is slow” or “Hummingbird is three times faster.” Per process, Hummingbird is substantially more efficient. Under production load with a real model, the client sees none of that difference. That is not a contradiction; they are two different metrics.

Reproducing the results

All benchmark code lives in bench/ in the gateway repository. make setup installs Python dependencies and npm packages; make all runs all three gateways through all four scenarios and writes results/summary.txt. The numbers in this article come from the run with oha 1.14 on the M4 Air described above. On a different system — more cores, active cooling, or different background load — the absolute numbers will differ. The relative proportions should remain stable.