China's Coding AI Is Closing the Gap Fast

MiniMax M3 coding model benchmark comparison — China's open-weights frontier

On June 1, MiniMax released M3 — an open-weights coding model that scores 59% on SWE-Bench Pro, edging out GPT-5.5 and Gemini 3.1 Pro. It supports a million-token context window, handles image and video input natively, and costs roughly 5-10% of what Western frontier models charge. It's open-weights. And it's out of China.

This isn't a benchmark novelty. It's the latest salvo in a pricing war that's collapsing the economics of AI-assisted coding — and the implications run deeper than any leaderboard position.

The 18-Day Wave That Changed the Math

To understand why M3 matters, zoom out two months. Between April 7 and April 24, four Chinese AI labs shipped competing open-weight coding models in an 18-day stretch:

Model	Lab	SWE-Bench Pro	Output Cost (per 1M tokens)	vs. Opus 4.7
GLM-5.1	Z.ai	58.4%	$3.50	14%
MiniMax M2.7	MiniMax	56.2%	$1.20	5%
Kimi K2.6	Moonshot AI	58.6%	$2.50	10%
DeepSeek V4-Flash	DeepSeek	55.4%	$0.28	1.1%

For context, Claude Opus 4.7 charges $25 per million output tokens and leads the publicly verified SWE-Bench Pro leaderboard at 64.3%. Every model in that April wave delivered competitive coding performance at a fraction of the cost — Kimi K2.6 tied with GPT-5.5 at 58.6%, at one-tenth the price.

TestingCatalog tweet — MiniMax M3 scores 59% on SWE-Bench Pro, on par with GPT-5.5, 1M context window

Then MiniMax came back with M3 just five weeks later. The cadence tells the story: this isn't a one-off release cycle. It's a drumbeat.

What MiniMax M3 Actually Brings

M3 is the first open-weights model to combine three frontier capabilities in a single architecture: frontier-level coding, a million-token context window, and native multimodality.

The coding numbers:

SWE-Bench Pro: 59.0% (surpasses GPT-5.5's 58.6%, approaches Opus 4.7's 64.3%)
Terminal-Bench 2.1: 66.0%
MCP Atlas: 74.2%
BrowseComp: 83.5% (actually beats Opus 4.7's 79.3 on autonomous web tasks)

MiniMax official tweet — Introducing M3: 59.0% SWE-Bench Pro, 66.0% Terminal Bench, 1M context via Sparse Attention

The context window runs on MiniMax Sparse Attention (MSA), a new architecture that reduces per-token compute to one-twentieth of MiniMax's previous generation at million-token scale. Prefilling runs 9x faster; decoding runs 15x faster. The GPU implementation clocks 4x faster than competing open-source sparse attention methods.

The multimodal capability isn't bolted on — M3 trained on approximately 100 trillion tokens of interleaved text-and-image data from inception. It can operate a desktop computer through visual input, which powers MiniMax's MCode desktop agent — a product that promises 24/7 autonomous task execution across applications, files, and systems.

WARNING

A critical caveat: all M3 benchmark scores are vendor-reported. As TechTimes noted, "every one of those numbers is vendor-run, on MiniMax's own infrastructure, with baselines they picked." Independent scores from Artificial Analysis and LMArena were still pending at launch. And MiniMax's comparison cherry-picks Opus 4.7 — the newer Opus 4.8 leads M3 by 10+ points on SWE-Bench Pro (69.2% vs 59.0%).

The Collision Course: Monetize vs. Commoditize

Here's the story the benchmarks don't tell you.

Right now, two forces are pulling the AI coding market in opposite directions.

The Western stack is monetizing. GitHub Copilot wraps MAI-Code-1-Flash — Microsoft's first in-house coding model, announced at Build 2026 — inside a premium subscription. Anthropic charges $25/M output for Opus. OpenAI gates its best coding performance behind enterprise tiers. The logic: coding agents are the first proof that a huge market will pay premium for closed models.

The Chinese stack is commoditizing. MiniMax, DeepSeek, Qwen, Moonshot, and Z.ai are shipping open-weights models on a near-weekly cadence, each one undercutting the last on price while closing the capability gap. Qwen3.7 Plus delivers multimodal agentic coding at $0.40 per million input tokens — a sixth of Qwen's own Max variant, and a rounding error compared to Opus. DeepSeek V4-Flash hits $0.28 per million output tokens. That's 1.1% of Opus pricing.

These aren't separate trends. They're the same market, pulling apart.

The Western premium stack needs the capability gap to justify its price. The Chinese open-weights stack needs to close that gap to justify its existence. Both are succeeding — which means the collision is getting closer, not further away.

The Benchmark Gap Is Real — But Shrinking

Let's be precise about where things stand.

Claude Opus 4.8 still leads the frontier. On SWE-Bench Pro, it scores 69.2% — a full 10 points ahead of M3's 59.0%. On Terminal-Bench 2.1, the gap is 8.6 points (74.6% vs. 66.0%). On OSWorld-Verified, it's 13.4 points.

But the trajectory matters more than the snapshot.

Six months ago, the best Chinese open-weights coding model scored in the low 40s on SWE-Bench Pro. Today, multiple Chinese models cluster between 55% and 60%. The gap contracted from 20+ points to roughly 10 in half a year. If that rate holds — and three separate labs are now pushing it — the "good enough" threshold arrives before the frontier does.

And for production deployments, "good enough at a tenth of the cost" often wins over "best at any price." Most agentic coding workflows run dozens of model calls per task. A 10x cost reduction doesn't just save money — it changes which workflows are economically viable in the first place.

OpenRouter tweet — MiniMax M3 live: frontier-class open-weight model with 1M context, coding and agentic performance

The Qwen Factor

Alibaba's Qwen team deserves special attention. They're not just shipping one model — they're shipping an ecosystem.

Qwen3.7 Max hits 60.6% on SWE-Bench Pro, making it the strongest Chinese model on that benchmark. Days later, Qwen3.7 Plus added multimodal input — text, image, video — at a sixth of Max's price, keeping the same million-token context and agentic backbone.

Meanwhile, Qwen3-Coder-Next runs an 80-billion parameter MoE architecture that activates only 3 billion parameters per query. It scores 70.6% on SWE-Bench Verified with SWE-Agent — competitive with models 10-20x larger in active parameters. It's open-weights, runs on consumer hardware, and is already the reference model that other open-source coding tools benchmark against.

The HN thread on MAI-Code-1-Flash — Microsoft's 5B-parameter coding model announced at Build — benchmarks it against Qwen3.6-35B at 49.5%. That's the tell: even Microsoft's own community reaches for the Chinese open-weight tier as the baseline.

MCode: When the Model Becomes a Product

MiniMax isn't just shipping a model with M3. They're shipping a desktop agent called MCode that turns the model into a 24/7 automation system.

MCode installs on Mac or Windows, works with local files, supports scheduled automations, and can route tasks across multiple specialized agents in parallel. Thanks to M3's native multimodal capabilities, it can operate across applications — opening ERP clients, batch-entering invoices from spreadsheets, monitoring competitor pricing — all without human intervention.

This matters because it shows the Chinese strategy isn't just about cheaper models. It's about building the product layer that captures the value that cheaper models create. Open-weights at the model layer, proprietary product at the application layer — the classic commoditize-your-complement play.

What the Open-Weights Label Actually Means

One important distinction: M3 is open-weights, not open-source. MiniMax released the trained parameters but not the training code or inference operators. You can use the model, fine-tune it, deploy it — but you can't fully reproduce or modify the training pipeline.

This matters for the moat argument. True open-source (like DeepSeek's approach) lets anyone rebuild the model from scratch. Open-weights gives you the finished artifact without the recipe. MiniMax is betting that the model weights are enough to capture developer adoption while the training infrastructure remains proprietary competitive advantage.

The model is already available on OpenRouter, Ollama Cloud, and multiple other platforms. Weights are coming to Hugging Face within days of launch. The distribution strategy mirrors the previous Chinese models: get the model into as many developer hands as possible, as fast as possible.

The MiniMax IPO Context

There's a business angle here too. MiniMax is preparing for dual listings — Hong Kong and Shanghai's Star Market. M3 is their first major product launch since formally beginning IPO preparations.

Ollama tweet — MiniMax M3 available on Ollama Cloud, US-based with zero data retention, for coding and agentic tasks

That context explains the aggressive benchmark positioning, the comparison against Opus 4.7 rather than 4.8, and the rapid launch cadence. MiniMax needs to prove it can compete at the frontier to justify its valuation. The fact that they can credibly make that case with an open-weights model — while Western labs charge 10-20x more for closed alternatives — is itself the market signal.

What This Means for Developers

If you're building AI-powered coding tools or agentic workflows, the practical implications are straightforward:

The cost floor just dropped again. M3 at MiniMax's token plan pricing ($20-120/month for billions of tokens) makes million-token-context coding agents economically viable for individual developers and small teams. Workflows that were cost-prohibitive with Opus pricing are now table stakes.

The multi-model future is here. The optimal stack is increasingly a blend: frontier closed models for the hardest tasks, Chinese open-weights for high-volume agentic work, and tiny specialized models (like MAI-Code-1-Flash at 5B parameters) for latency-sensitive autocomplete. No single provider wins every use case.

Watch the independent benchmarks. M3's vendor-reported numbers are promising but unverified. Wait for Artificial Analysis and LMArena scores before making production deployment decisions. The April wave models have had time to be independently verified — M3 hasn't.

The moat is moving. If you're betting your product strategy on a model capability gap that exists today, you're building on a narrowing foundation. The hidden costs of cheap models are real — quality variance, support gaps, compliance questions — but they're getting smaller with each release. For a broader look at how these models compare in practice, see our coding assistant comparison and the DeepSeek V4 breakdown.

The Bottom Line

The West is building the most capable coding AI. China is building the most accessible. Both are right — and both strategies work, for now.

But when the capability gap between a $25/M-token model and a $1.20/M-token model narrows from 20 points to 10, the economics start doing the talking. MiniMax M3 isn't the model that closes the gap. It's the model that makes the gap's closure feel inevitable.

The coding moat hasn't fallen yet. But the water level is rising, and it's rising fast.

China's Coding AI Is Closing the Gap Fast

The 18-Day Wave That Changed the Math

What MiniMax M3 Actually Brings

The Collision Course: Monetize vs. Commoditize

The Benchmark Gap Is Real — But Shrinking

The Qwen Factor

MCode: When the Model Becomes a Product

What the Open-Weights Label Actually Means

The MiniMax IPO Context

What This Means for Developers

The Bottom Line

ComputeLeap Team

Join the discussion

Related articles

GPT-5.6 Closed a 30-Year Math Gap. Nobody Noticed.

Open Models Now Run 63% of AI's Token Traffic

The Open-Weight Frontier Arrived in a Single Day

The ComputeLeap Weekly