AI Tools11 min read

OpenRouter Fusion vs Claude Fable 5: 7x Slower, 4x the Cost

OpenRouter claims Fusion beats Fable 5 at half the price. HN benchmarks say otherwise. Here's when multi-model routing earns its cost.

CL

ComputeLeap Team

Share:

OpenRouter just launched Fusion, a multi-model routing API that fans your prompt out to multiple LLMs simultaneously, synthesizes their responses through a judge model, and returns a single answer. The pitch: frontier-level intelligence at half the price of Claude Fable 5. The Hacker News reality check: 7× slower and 4× the cost of just calling a single top model directly.

So which is it?

The timing is not a coincidence. With Anthropic's Fable 5 freeze still reverberating — the model pulled barely a week ago over export control concerns — operators are scrambling for a single-vendor-risk hedge. OpenRouter is selling exactly that: don't depend on one frontier model when you can blend several. But the economics of multi-model routing are more nuanced than the marketing suggests.

Let's break down what Fusion actually does, what the benchmarks say, and — critically — when the math works in your favor versus when you're just paying more for slower answers.

How Fusion Actually Works

OpenRouter Fusion architecture diagram showing panel phase, judge phase, and synthesis phase

Fusion operates in three sequential phases, documented in OpenRouter's plugin guide:

Panel Phase. Your prompt goes out to up to 8 models in parallel. Each model has access to web search and web fetch tools, so they can ground their responses in real-time data. The default Quality preset sends to Fable 5 + GPT-5.5; the Budget preset uses Gemini 3 Flash + Kimi K2.6 + DeepSeek V4 Pro.

Judge Phase. A designated judge model (Claude Opus by default) receives all panel responses and performs comparative analysis. It produces structured JSON categorizing: consensus points, contradictions, partial coverage areas, unique insights from individual models, and blind spots none of them addressed.

Synthesis Phase. Your primary model receives the judge's structured analysis to craft the final response. This is the answer you actually get back.

The critical detail for your bill: you pay for every underlying completion plus the judge call. A 3-model panel means roughly 4–5× the cost of a single completion on the same prompt. OpenRouter's pricing page confirms it: "your request is priced as the sum of those underlying completions."

Fusion pricing is cumulative — you pay every underlying model completion plus the judge call. A Quality run costs 3.2× what a single Opus 4.8 call costs. Budget is the cost-efficient option at 0.40× of solo Fable 5.

The DRACO Numbers: What the Benchmark Says

OpenRouter's launch blog post leads with DRACO benchmark results — a research-task evaluation covering 100 complex queries.

OpenRouter official Fusion launch announcement — Budget panel matches Fable 5 at 50% cost, Quality panel beats it by 3.7 DRACO points

Here's the leaderboard:

DRACO benchmark comparison chart showing Fusion Quality at 69.0% vs Solo Fable 5 at 65.3% vs Fusion Budget at 64.7%
ConfigurationDRACO ScoreCost per Prompt (8K/2K)
Fusion Quality (Fable 5 + GPT-5.5)69.0%$0.29
Fusion Quality (Opus 4.8 + GPT-5.5 + Gemini 3.1 Pro)68.3%~$0.25
Claude Fable 5 (solo)65.3%~$0.10
Fusion Budget (Gemini 3 Flash + Kimi + DeepSeek)64.7%$0.04
DeepSeek V4 Pro (solo)60.3%~$0.02
GPT-5.5 (solo)60.0%~$0.06

The Quality preset does beat solo Fable 5 — by 3.7 percentage points. And the Budget preset comes within 0.6 points of Fable 5 at roughly 40% of the cost. Those numbers are real.

But context matters. OpenRouter acknowledges several caveats in the fine print: Fable 5 completed only 93 of 100 tasks due to content filters, DRACO evaluates text-only English-only interactions, absolute scores vary 10–25 points depending on which model you use as the judge, and — perhaps most importantly — Fusion showed "no advantage for long-horizon tasks, which is where Fable shines."

MindStudio's independent comparison found similar numbers: Fusion reaches 64.7% vs Fable 5's 65.3% on their test set, a gap of 0.6 percentage points. Close enough that most applications won't feel the difference — but you're paying for parity, not gaining from it.

The HN Reality Check: 7× Slower, 4× the Cost

Hacker News discussion — OpenRouter Fusion API, 200 points, 78 comments — key skeptical comments about cost and latency

The Hacker News thread that collected 200 points and 78 comments tells a more sobering story than the benchmark deck.

The most upvoted technical comment came from a developer who'd built a similar fusion system: "Fusion was 7× slower and 4× the cost compared to calling Opus 4.7 or GPT 5.5 directly." Their conclusion: it's a "use it only when you need it" feature, not a default routing strategy.

The same commenter raised a deeper concern about the judge model approach: having one model judge another's response essentially asks "how closely does this resemble the answer you would have given me." Additional rounds of judging amount to "just cranking up the temperature" without delivering objectively better answers.

Top HN comment on Fusion — 7× slower, 4× the cost, judge model just asks how closely the answer resembles its own

HN community consensus: multi-model judging works well for verifiable answers (like resume tailoring or factual research) but performs poorly for ambiguous domains where there isn't a clear "right" answer to judge against.

Other HN commenters added nuance. One pointed out that effective results require explicit instructions separating truth evaluation from usefulness assessment — without careful prompt engineering for the judge, you get nitpicks rather than genuine quality improvements. Another noted that multi-model routing is strategic for verifiable domains but adds latency without improving outcomes for open-ended tasks.

Perhaps the most interesting technical observation came from a related thread: fusing identical models also boosted performance. That suggests the gains come primarily from additional test-time compute (more inference passes = more refined answers), not from model diversity. If true, Fusion's value proposition shifts from "blend the best models" to "spend more compute at inference time" — which you could do in other ways.

Budget vs Quality: Two Very Different Products

Annual cost comparison — Quality Fusion at $34,800/year vs Solo Fable 5 at $12,000/year vs Budget Fusion at $4,800/year

Buried in the pricing data is a critical distinction that OpenRouter's marketing glosses over. TokenMix's independent review breaks down the annual math:

TokenMix annual cost breakdown — Quality Fusion $34,800 vs Solo Fable 5 $12,000 vs Budget Fusion $4,800

Quality Fusion at 10K prompts/month costs approximately $34,800/year. Solo Fable 5 at the same volume: $12,000/year. You're paying 2.9× more for a 3.7 percentage point DRACO lift. That's roughly $6,160 per percentage point per year.

Budget Fusion is the opposite story. At $0.04 per prompt, it costs roughly $4,800/year for the same volume — 60% less than solo Fable 5 while scoring within 0.6 points on DRACO. This is the actual "half the price" product that the marketing leads with.

These are two fundamentally different value propositions:

  • Budget Fusion is a genuine cost play: near-frontier performance from cheap models, boosted by the ensemble effect. If you're running high-volume batch tasks and 64.7% DRACO performance is acceptable, this is compelling.

  • Quality Fusion is a premium surcharge for the last 3.7 points of benchmark performance. It only makes economic sense when the value per correct answer exceeds ~$0.19 in marginal gains — which limits it to high-stakes domains like legal analysis, compliance, or medical research.

When Fusion Earns Its Cost (and When It Doesn't)

Based on the benchmark data, community feedback, and pricing analysis, here's a practical decision framework:

Use Fusion Quality when:

  • Output value exceeds $1 per task (legal briefs, compliance reviews, high-stakes research)
  • You need demonstrable cross-model consensus for audit trails
  • The task has verifiable right answers that a judge model can meaningfully evaluate
  • Latency tolerance is 1–3 seconds (not real-time)

Use Fusion Budget when:

  • You're running high-volume batch processing where 65% DRACO-tier performance suffices
  • You want frontier-adjacent results without frontier pricing
  • Single-vendor risk matters more than raw speed (post-Fable-5-freeze hedging)

Skip Fusion entirely for:

  • Real-time interactive applications requiring sub-500ms response
  • Code completion, chat, and content generation (high-volume, latency-sensitive)
  • Long-horizon tasks where Fable 5 has a documented advantage Fusion can't match
  • Any workflow where you'd be paying 3× for a 3.7-point benchmark lift you can't monetize

A useful heuristic from the HN thread: if you can articulate why a single skilled human reviewer would consult three experts before answering, Fusion's panel model fits. If that's overkill, single-model is faster and cheaper.

The Bigger Picture: Test-Time Compute vs Model Diversity

Fusion didn't launch in a vacuum. The Fable 5 freeze exposed a structural vulnerability in every production stack that depends on a single frontier provider — and the market responded with a wave of multi-model tooling within days.

Fusion enters a market that's moving fast on multi-model inference. Andrew Ng's aisuite, trending on GitHub this week with +270 stars/day, takes a different approach: a unified API that lets you switch between providers with a simple provider:model string, without the ensemble overhead. It's the plumbing for multi-model strategies without forcing you through a judge-and-synthesize pipeline. For teams that want provider portability without the latency tax of multi-model deliberation, aisuite offers a lighter-weight alternative that doesn't multiply your per-call cost.

The broader question Fusion surfaces is whether the future of inference quality lies in model diversity (different architectures compensating for each other's blind spots) or test-time compute (spending more inference passes on the same model). The HN finding — that fusing identical models also improves performance — suggests it might be the latter. If so, approaches like best-of-N sampling or extended thinking tokens could deliver similar quality gains without the complexity of a multi-model panel.

TheAIGRID's tutorial video walks through the practical setup, showing the quality-vs-budget mode tradeoff and how to track Fusion pricing per request. It's a good starting point if you want to test Fusion against your own workload before committing.

The Verdict for Operators

OpenRouter Fusion is a real product with real benchmark gains — not vaporware. The Quality preset genuinely beats solo Fable 5 on DRACO by 3.7 points. The Budget preset genuinely matches Fable 5 at 40% of the cost.

But the marketing framing — "Fable-level intelligence at half the price" — obscures a critical split. Budget Fusion delivers on that promise for batch workloads. Quality Fusion costs 3× more than Fable 5 alone, making it a premium tier that only pencils out for high-value-per-task domains.

The Fable 5 freeze created a legitimate demand signal for vendor diversification. If your stack depends on a single frontier model that can get pulled overnight, Fusion's Budget preset is a reasonable hedge: spread your inference across three cheap models and get within 0.6 DRACO points of the frontier. That's a real operational benefit.

But if you're evaluating Fusion Quality as a default replacement for solo Fable 5 or Opus 4.8 in your production pipeline, the HN crowd has the right read: 7× slower, 4× the cost, and the judge layer adds complexity without proportional quality gains for most use cases. Use it surgically — for the high-stakes tasks where cross-model consensus matters — not as your everyday inference router.

The multi-model routing category is real and growing. But the first generation of products is still finding the line between "useful redundancy" and "expensive overhead." Fusion's Budget preset sits on the right side of that line for batch workloads. The Quality preset, for now, is an expensive bet that the ensemble effect can consistently outperform the models it's built from — and the benchmarks don't yet prove that case for the majority of production use cases.

CL

About ComputeLeap Team

The ComputeLeap editorial team covers AI tools, agents, and products — helping readers discover and use artificial intelligence to work smarter.

💬 Join the Discussion

Have thoughts on this article? Discuss it on your favorite platform:

The ComputeLeap Weekly

Get a weekly digest of the best AI infra writing — Claude Code, agent frameworks, deployment patterns. No fluff.

Weekly. Unsubscribe anytime.