Subquadratic SubQ: A Startup Claims to Have Broken the Transformer Scaling Law

Subquadratic SubQ: A Startup Claims to Have Broken the Transformer Scaling Law

On May 5, 2026, a Miami startup called Subquadratic emerged from stealth with $29 million in seed funding and announced an LLM that, on paper, sounds impressive enough to make you stop short: SubQ, the first productively available model with a fully sub-quadratically scaling attention architecture. The promises are 12 million tokens of context, a 52× speedup over FlashAttention at one million tokens, and a cost of roughly one three-hundredth compared to frontier models like Claude Opus.

What’s the technical problem?

The actual problem Subquadratic is going after is old and boring at the same time: standard transformer attention compares every token with every other token. If we double the input length, the compute work quadruples. This quadratic scaling is the wall long-context applications have been standing in front of for years. Doubling 1 million tokens to 2 million quadruples the attention cost — not doubles it.

Subquadratic is countering this with an architecture called SSA (Subquadratic Selective Attention). The idea isn’t new, but the execution claims to be: instead of comparing every token pair, the model is supposed to content-dependently decide which positions are relevant for a given query at all, and compute exactly only over those. The company’s blog post puts the rationale explicitly:

SSA does not approximate attention. It restricts attention to the positions that actually carry signal, and skips the rest.

That’s supposed to be the trick that distinguishes SSA from earlier attempts. We know the field:

  • Fixed-Pattern Sparse Attention (Sliding Window, Strided, Dilated) shrinks the search space, but decides position-based — the model decides where it looks before knowing what it’s looking for. If the relevant information lies outside the pattern, it simply isn’t seen. Practical consequence: Multi-hop retrieval and scattered evidence fall through the grid as soon as the pattern doesn’t accidentally hit the answer’s position — accuracy isn’t gradually worse, it collapses.
  • State Space Models like Mamba replace the pairwise comparisons with a compressed state that evolves across the sequence. Linear in scaling, but with fixed capacity — information becomes blurry or gets lost as sequence length grows. Practical consequence: on tasks requiring exact recall of earlier tokens — code refactoring across a large file, legal cross-references, long dialogues with backreferences — accuracy measurably drops with growing sequence length.
  • Hybrid architectures combine efficient and dense layers. In practice the dense layers continue to carry the load, which means quadratic scaling is only deferred, not removed. Practical consequence: the promised scaling profile holds for synthetic benchmarks and short demo workloads; for real long-context applications the cost curve remains quadratic-dominated.
  • DeepSeek Sparse Attention offloads selection to a “lightning indexer” — which itself, according to Subquadratic’s analysis, scales quadratically. Complexity moved, not removed. Practical consequence: at very long context the indexer itself becomes the bottleneck, and the O(n²) wall returns through the back door — just one layer earlier.

SSA, by its own description, attempts to solve the actual open problem: a mechanism that is simultaneously efficient, content-dependent, and capable of retrieving from arbitrary positions across long contexts. Whether that’s true comes up for testing in a moment.

The numbers on the table

Subquadratic puts forward a series of benchmarks. We cleanly separate self-measurements from third-party verified values, because that distinction is the whole point when classifying the numbers.

Speed (self-measurement on B200 GPUs)

Context lengthSSA speedup vs. FlashAttention
128K7.2×
256K13.2×
512K23.0×
1M52.2×

Compute reduction (self-measurement)

At 1 million tokens the company reports a 62.5× reduction in attention FLOPs compared to standard quadratic attention. At 12 million tokens the factor is supposed to head toward 1,000×.

Retrieval benchmarks

Here the source question becomes more decisive, hence a marker per benchmark.

RULER @ 128K (self-measurement by Subquadratic) — a benchmark for multi-hop retrieval, aggregation, and variable tracking. Subquadratic reports 95.0% versus 94.8% for Claude Opus 4.6. Practically tied, with allegedly drastically lower costs.

MRCR v2 measures how well a model finds and integrates several scattered pieces of evidence across long contexts. The table gets interesting here because self-measurement and external verification mix — the SubQ value is the only number in the entire announcement that is explicitly flagged as third-party verified; all other model values are given by Subquadratic in its own comparison (typically taken from the respective vendor reports, without re-running the measurement independently):

ModelMRCR v2Source
Opus 4.678.3 %reported by Subquadratic
GPT-5.574.0 %reported by Subquadratic
SubQ65.9 %third-party verified
GPT-5.436.6 %reported by Subquadratic
Opus 4.732.2 %reported by Subquadratic
Gemini 3.1 Pro26.3 %reported by Subquadratic

SubQ sits clearly ahead of most frontier models, but below Opus 4.6. Notable: Opus 4.7 and Gemini 3.1 Pro fall surprisingly short on this test. The benchmark apparently measures something current dense models’ routing strategies don’t capture — which can speak both in favor of SubQ and call the validity of the benchmark itself into question.

SWE-Bench Verified (self-measurement): SubQ lands at 81.8%, narrowly ahead of Opus 4.6 (80.8%), but well behind Opus 4.7 (87.6%). Coding is therefore not the area where SubQ makes the competition look old.

Methodical caveats

Three things stand out that shouldn’t be brushed off the table.

First, the usual self-measurement problem. We have a young company, a model not yet publicly available — early access only by application — and benchmark numbers that come almost exclusively from the company itself. Subquadratic explicitly writes in the technical post that a full model card is still to come. The only number flagged as third-party verified is the MRCR-v2 score of 65.9%. That’s decent, but it’s not a “1,000× efficiency leap.” VentureBeat picked up exactly this in its report: researchers demand independent proof and disagree on whether the breakthrough is real.

Second, the historical learning curve. We’ve had Mamba, RWKV, Linear Attention, State Space Models, Performer, Reformer, Longformer, BigBird — the list of “transformer killers” that arrived with impressive benchmarks and ended up vanishing into niches is long. The Subquadratic team itself points out the weaknesses of all predecessors in its blog post. The question of why this architecture doesn’t inherit the limitations of the others, the company answers with a reference to “content-dependent routing without quadratic indexer.” That’s plausible, but the scientific community hasn’t reviewed it yet.

Third, the state of discussion on Hacker News. The first longer comment on the announcement wonders why the matter isn’t getting more attention — and closes with the candid “I suppose it’s just an announcement and we can’t test it ourselves yet”. That’s exactly the point: as long as we can’t get our hands on the model, every number is marketing.

There’s also a methodical detail. In the technical post Subquadratic writes verbatim, “FlashAttention-3 did not produce a speedup on B200s over FlashAttention-2”, and therefore picks FA-2 as the baseline. That may be true on the tested hardware, but it’s a convenient choice: FA-3 is, according to the FlashAttention repository, “optimized for Hopper GPUs (e.g. H100)” — not for Blackwell. By now there is FlashAttention-4, which the same repo describes as “written in CuTeDSL and optimized for Hopper and Blackwell GPUs (e.g. H100, B200)”. So a baseline does exist that is designed for the exact tested hardware — and Subquadratic still benchmarks against FA-2. A run against FA-4 on B200 would more precisely show how much of the 7.2× / 52× speedups actually come from SSA and how much from FA-2 simply not being optimized for Blackwell.

Practical implications, if the architecture holds

Let’s run the optimistic case once, because that’s part of a sober classification too.

If SSA delivers what it promises, what changes isn’t the model — it’s how we build applications. The last two years of LLM engineering consist of one big workaround discipline: RAG pipelines, chunk strategies, hybrid search, reranking, agent decomposition, subagent orchestration, context compaction, memory systems. We’ve put in gigantic effort to get around the O(n²) attention limit.

A usable 12-million-token context simply puts some of these disciplines into question. A complete codebase in one pass? An entire compliance document plus all referenced annexes simultaneously in view? A long advisory case with all side files in a single prompt? That is exactly the pitch, and that is exactly where the practical value lies. The Subquadratic team puts it this way:

The failure mode of short-context systems is not merely that they are missing some context. It is that they are forced to reason about fragments.

That’s true. But — and here comes the sober part — nominal context is not functional context. The model itself points to this distinction: a context window says nothing about how reliably the model can reason across it. That’s exactly the gap the MRCR-v2 numbers try to close, and that’s exactly where SubQ still trails Opus 4.6.

Practical positioning

A thought experiment helps us. Imagine someone walks into the office tomorrow and says, “Let’s throw out RAG for our SAP knowledge base — we have 12 million tokens of context now.” What’s our straight answer?

It is: Maybe later, but not now. For three reasons. First, availability. Anyone who can’t get their hands on it isn’t building with it. Second, maturity. Even if the architecture works, the tooling ecosystems, operations experience, and failure-mode catalogs are missing — the kind we’ve built up around Claude, GPT, and Gemini over the past two years. Third, functional context. As long as MRCR v2 sits below the value of Opus 4.6, 12 million tokens is more promise than usable reality.

What we do is keep SubQ on the watchlist. If independent replications come in — from the academic community, from HuggingFace reproductions, from serious production deployments — the picture changes. Until then we file the announcement where it currently belongs: interesting architecture, plausible thesis, impressive self-measurements, and an exclamation mark drawn in pencil.

Why this drumroll matters at all

Even if SubQ ends up falling short of its own promises, the discussion is productive. It forces the field to measure functional rather than nominal context. It makes clear that today’s workarounds — RAG, chunk strategies, multi-agent choreography — aren’t the final answer but symptoms of an architectural problem. And it serves as a reminder that the genuinely interesting progress in AI is measured not in parameter counts but in better inductive biases.

The picture so far: a young company claims to solve a long-known problem with a new method. Whether the method actually holds up cannot be answered from the data currently available. Independent of the outcome, however, the problem formulation remains — content-dependent routing without a quadratic indexer is the right technical frame for thinking about the next chapter of LLM architectures.

SubQ stays on the watchlist. Until independent replications are available, the published numbers are an announcement, not a reliable basis for architectural decisions.

Sources