Tutorials15 min read

Qwen3.6-35B on Mac: Setup Guide + Beats Claude Opus 4.7

Alibaba's Qwen3.6-35B-A3B runs on MacBook Pro via LM Studio. Only 3B active params, Apache 2.0. Setup guide + why it beat Claude Opus 4.7 locally.

CL

ComputeLeap Team

Share:
MacBook Pro M5 with holographic MoE neural network showing only 3B of 35B parameters active

Alibaba released Qwen3.6-35B-A3B on April 16, 2026, and by that afternoon it had done something no open-source model had managed cleanly before: running on a MacBook Pro M5 and outperforming Claude Opus 4.7 on a benchmark test. The model is a sparse Mixture-of-Experts (MoE) architecture with 35 billion total parameters but only 3 billion active per forward pass, under an Apache 2.0 license with full commercial use permitted. Developer Simon Willison tested it locally via LM Studio using a 20.9GB quantized build, and gave the win to Qwen—Opus 4.7 "managed to mess up the bicycle frame."

@Alibaba_Qwen tweet announcing Qwen3.6-35B-A3B open source release — Apache 2.0, 3B active params

This guide is for developers who saw that HN post (#3 with 973 points), thought "I want to run this on my Mac," and want a clear path from download to first inference. We'll cover what the MoE architecture actually means for your hardware, how to install via LM Studio in under 10 minutes, what real-world performance looks like on M-series Macs, and when to reach for this model over a hosted API.

What Sparse MoE Means (Without the Jargon)

Most AI models are "dense"—every parameter fires on every token. A 27B dense model activates all 27 billion weights every single time it predicts the next word. That's why large dense models are slow and expensive to run locally: the compute scales linearly with parameter count.

Qwen3.6-35B-A3B is different. It uses a Mixture of Experts (MoE) architecture with 256 total experts, but only 8 routed + 1 shared expert activate per token. In practice: the model has 35B parameters loaded in memory, but for each inference pass it only computes through roughly 3 billion of them.

Diagram: Expert Router selects 8 of 256 experts per token, plus 1 shared expert — most experts inactive (dark), active ones highlighted in amber

Here's the practical implication: inference speed is determined by active parameters, not total parameters. A model with 3B active parameters generates tokens roughly as fast as a 3B dense model, even though it "knows" as much as a 35B one. The memory footprint reflects total size (you still load all 35B weights), but the speed reflects the active slice.

This is why community benchmarks show:

  • RTX 4090: 40 tokens/sec prefill, 20 tokens/sec decode
  • MacBook Pro M1 Max (64GB): 140 tokens/sec on Q4 quantization
  • MacBook Pro M3/M4/M5 (32GB): Comfortably runs the Q4_K_S build at interactive speeds

For comparison: a dense 30B model on the same hardware would generate tokens at roughly 1/10th that speed. MoE is how Alibaba delivered 30B-equivalent intelligence at 3B-equivalent inference cost.

Qwen3.6-35B-A3B needs ~22GB RAM for the Q4 quantization. MacBook Pro M3/M4/M5 with 32GB+ runs it comfortably. M2 with 24GB works but leaves less headroom for the OS and KV cache.

The Simon Willison Benchmark

Before we get to setup, it's worth understanding what "beat Claude Opus 4.7" actually means.

Simon Willison—author of sqlite-utils, Django contributor, and one of the most rigorous model benchmarkers in the space—ran his standard "pelican riding a bicycle" SVG generation test. This benchmark has been running since GPT-4, and while Willison himself calls it "mainly a statement on how obtuse and absurd the task of comparing these models is," he notes there's historically been a correlation between pelican quality and general model usefulness.

The results: Qwen3.6-35B-A3B produced an SVG with a correct bicycle frame and creative details (clouds in the sky). Claude Opus 4.7—tested at both standard and maximum thinking mode—"managed to mess up the bicycle frame." Willison ran a second test (flamingo riding a unicycle) and awarded that to Qwen as well: "creative touches like sunglasses and a bowtie" versus Opus's "competent if slightly dull vector illustration."

@mervenoyann tweet about Qwen3.6-35B-A3B — 1.2M views, amplified by @jeremyphoward

The hardware context matters here: Willison ran this on a MacBook Pro M5 using the Qwen3.6-35B-A3B-UD-Q4_K_S.gguf quantization (20.9GB) from Unsloth, via LM Studio. This is not a data center comparison. This is a laptop versus a frontier API model—and the laptop won on a creative reasoning task.

One important caveat from Willison himself: "I doubt the quantized model is genuinely more powerful overall—it simply excels at this specific creative task on local hardware." The benchmark is a proof-of-concept, not a comprehensive evaluation. For coding assistance, document analysis, or long-context work, Opus 4.7 with its full context window and cloud compute may still outperform a quantized local model.

What Qwen3.6-35B-A3B Actually Does Well

The model card benchmarks paint a clearer picture of where this model genuinely shines:

BenchmarkQwen3.6-35B-A3B
SWE-bench Verified73.4
AIME 202692.7
GPQA Diamond86.0
MMMU (multimodal)81.7
Terminal-Bench 2.051.5

The SWE-bench score (73.4) is the headline number—that's real-world software engineering task completion on GitHub issues. For a model that runs on your laptop, matching scores that frontier models delivered six months ago is remarkable.

Natively multimodal: Unlike many MoE models, Qwen3.6-35B-A3B includes a vision encoder. You can pass images, screenshots, and video frames directly. This is rare at this parameter efficiency.

Thinking/non-thinking mode: The model supports two inference modes—deliberate step-by-step reasoning (thinking mode) for complex problems, or fast direct responses (instruct mode) for quick tasks. You switch between them via a parameter, not by downloading a different model.

Context length: 262,144 tokens natively, extensible to 1,010,000 tokens via YaRN scaling. In practice on local hardware you're constrained by RAM, but even at Q4 quantization you can work with very long contexts compared to most local alternatives.

Agentic coding: The Qwen team specifically enhanced repository-level reasoning and frontend workflow handling. If you're using this as a Claude Code alternative for local agentic tasks, the 73.4 SWE-bench score is what you're buying.

Setup: Running Qwen3.6-35B-A3B on Mac via LM Studio

This is the path Willison used, and it's the easiest way to get the model running. Total time: under 10 minutes.

Prerequisites

  • Mac with Apple Silicon (M1/M2/M3/M4/M5)
  • At minimum 24GB unified memory (32GB recommended)
  • ~25GB free disk space
  • LM Studio (free, download from lmstudio.ai)

Step 1: Install LM Studio

Download LM Studio from lmstudio.ai. Install the .dmg, launch it. LM Studio uses MLX as the inference backend on Apple Silicon, which means native Metal GPU acceleration with no driver setup required.

Step 2: Find Qwen3.6-35B-A3B

In LM Studio, click Discover in the left sidebar. Search for qwen3.6. You'll see multiple quantization variants:

VariantSizeBest For
Q4_K_S (recommended)20.9GBBest balance: quality + speed
Q4_K_XL22.4GBSlightly better quality, needs 24GB+
IQ2_M~12GB16GB Mac, reduced quality
BF16 (full precision)69.4GBMac Studio 192GB+
Use Unsloth's Q4_K_S quantization (20.9GB). It's the sweet spot between quality and memory footprint for most Mac setups. If you have 32GB+ RAM, try Q4_K_XL for marginal quality improvement.

Click the download icon next to your chosen variant. LM Studio handles the rest. The download is ~21GB so expect 5–10 minutes depending on your connection.

Step 3: Load and Chat

Once downloaded, click Load Model and select your Qwen3.6-35B-A3B build. Switch to the Chat tab. First inference takes a few seconds to warm up the model layers; subsequent responses are fast.

For coding tasks, the model performs best with a system prompt like:

You are an expert software engineer. Think through problems step by step before responding.

Step 4: Use the Local API

LM Studio exposes an OpenAI-compatible local API on http://localhost:1234/v1. Any tool that supports OpenAI API format—Claude Code via --openai-base-url, Cursor, Aider, Continue.dev—can talk to it.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"  # any string works
)

response = client.chat.completions.create(
    model="qwen3.6-35b-a3b-ud-q4_k_s",
    messages=[{"role": "user", "content": "Refactor this function for readability..."}],
    temperature=0.6,
    top_p=0.95,
)
print(response.choices[0].message.content)

For thinking mode (slower, more deliberate), set temperature=1.0 and presence_penalty=1.5. For fast instruct mode: temperature=0.7, top_p=0.8.

Alternative: Ollama

If you prefer a command-line workflow:

ollama run qwen3.6:35b-a3b

Ollama handles quantization selection automatically. Less control over specific quants, but zero friction to get started.

Wait a week before benchmarking day-zero quantizations. Unsloth re-released the Qwen3.6 GGUFs multiple times on launch day due to llama.cpp bug fixes. The model works fine for regular use—just don't base evaluation decisions on day-1 performance numbers.

Real-World Performance on Apple Silicon

Based on community testing from the HN thread (973 points, 433 comments) and r/LocalLLaMA:

MacBook Pro M1 Max (64GB):

  • First run: under 90 seconds
  • Q4 quantization: ~140 tokens/sec generation
  • Comfortable for long-context agentic tasks

MacBook Pro M3/M4 Pro (36GB):

  • Q4_K_S runs with ~12GB headroom for OS and KV cache
  • Interactive speeds for chat and coding assistance
  • Long documents (100K+ tokens) get slower but manageable

MacBook Pro M2 (24GB):

  • Q4_K_S works but leaves minimal headroom
  • Stick to shorter contexts (<32K tokens) to avoid swapping
  • Consider IQ2_M quantization for more comfortable operation

RTX 4090 (Windows/Linux):

  • 40 tokens/sec prefill, 20 tokens/sec decode
  • Full GPU inference without CPU offloading

The MoE architecture has one specific advantage for constrained systems: CPU offloading. Because inactive expert layers don't participate in computation, tools like KTransformers can offload them to system RAM while keeping active experts on GPU VRAM, making the model viable on 16GB VRAM + 64GB system RAM configurations where a dense 35B would be completely infeasible.

When to Use Qwen3.6-35B-A3B vs. Claude/GPT API

This model earns its place in your toolkit for specific use cases, not as a wholesale Claude replacement.

Use Qwen3.6-35B-A3B locally when:

  • You need coding assistance offline — airplane, spotty internet, corporate VPN restrictions
  • You're building agentic pipelines where API costs add up quickly (code review loops, test generation, documentation)
  • You have data privacy requirements — no data leaves your machine
  • You want fast iteration on prompts without API latency or cost
  • You're doing long-running autonomous tasks where cloud API timeouts are an issue

Stick with Claude Opus 4.7 / GPT when:

  • You need the absolute best reasoning quality for complex multi-step analysis
  • You're working with massive context (the quantized local model degrades more at very long contexts than the cloud APIs)
  • Speed at scale matters (Opus 4.7 via API is faster than local inference for most Mac users)
  • You want vision tasks at production quality — cloud models have more tuned vision pipelines
  • You need guaranteed uptime for customer-facing applications

The honest framing: Qwen3.6-35B-A3B is the best local alternative to a frontier model that has ever existed. It's not better than Opus 4.7 on average—but it's close enough that the local advantage (privacy, cost, latency, offline capability) tips the balance for developer workflows.

What the Community Is Saying

The HN thread hit #3 with 973 points and 433 comments within hours of release—strong signal for a model launch.

Hacker News thread: Qwen3.6-35B-A3B — #3 with 973 points and 433 comments

The community reaction splits into two camps. Enthusiasts noted the benchmark performance and efficiency gains immediately. One commenter summarized the MoE advantage clearly: "The 35B and 27B models work on a 22GB Mac / RAM device. The 35B-A3B model only activates 3 billion parameters per token, but all 35 billion parameters must be loaded into memory. The speed benefit is real — inference only computes over the active portion — but the memory footprint reflects the total model size."

On the skeptical side, Unsloth's Daniel Chen acknowledged multiple re-releases: "Wait a week before relying on day-zero quantizations for evaluation." This is good advice—the model works for regular use immediately, but benchmark comparisons should wait for stable builds.

The @mervenoyann tweet announcing the release reached 1.2M views, with @jeremyphoward (fast.ai) amplifying—strong signal that the practical ML community is paying attention. The Apache 2.0 license specifically attracted comments from developers who had concerns about Llama's custom license for commercial products.

Simon Willison's benchmark post—"Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7"—generated its own HN thread with 347 points and 75 comments, separating the community reaction to the local performance story from the general release discussion.

Hacker News: Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 — 347 points, 75 comments

The broader signal: "Chinese labs are winning the open-source race" on efficiency. Qwen3.5 was already strong; Qwen3.6 extends the lead. The open-source efficiency gap with frontier closed models is shrinking faster than the market expected.

The Apache 2.0 License: Why It Matters

This sounds like a footnote but it isn't.

Llama models use a custom Meta license that restricts certain commercial uses (notably, services with >700M monthly active users can't use it). Many developers building products had to carefully read the fine print.

Qwen3.6-35B-A3B is Apache 2.0. Full commercial use. No restrictions. You can build products on it, ship it embedded in applications, fine-tune and redistribute—without asking permission or checking MAU counts. For commercial developers exploring local AI, this removes the last friction point.

Thinking Mode and Agentic Coding Features

Two Qwen3.6 capabilities deserve special mention for developers building agentic systems:

Dual-mode inference: The model supports enable_thinking: True/False at the API level. Thinking mode adds a scratchpad reasoning step before the final response—slower but more accurate for complex problems. Instruct mode gives fast direct answers. You don't load a different model; you flip a parameter. This is genuinely useful for agentic pipelines where some tasks need deliberation (code review, architecture decisions) and others need speed (file listing, simple transforms).

Thinking Preservation (preserve_thinking: True): For iterative workflows, the model can retain its reasoning context from previous messages. This means in a multi-turn coding session, the model's "thought process" carries forward—it builds on previous reasoning rather than starting fresh each turn. For repository-level refactoring tasks or debugging sessions, this matters.

# Enable thinking preservation for iterative coding sessions
response = client.chat.completions.create(
    model="qwen3.6-35b-a3b-ud-q4_k_s",
    messages=conversation_history,
    extra_body={
        "chat_template_kwargs": {
            "enable_thinking": True,
            "preserve_thinking": True
        }
    }
)

Comparison: Qwen3.6-35B-A3B vs. Similar Local Models

ModelActive ParamsLicenseMac-ReadyContext
Qwen3.6-35B-A3B3BApache 2.0262K
Gemma 4-31B31BGemma License128K
Llama 3.3-70B70BLlama CustomPartial128K
Mistral-Small-27B27BApache 2.032K

For Mac users specifically, Qwen3.6-35B-A3B wins on: active parameter efficiency (enables faster inference at lower memory pressure), context length (262K vs. 128K), Apache 2.0 license, and native multimodal capability.

The closest prior comparison is our Gemma 4 local setup guide from April 6. Gemma 4 at 27B remains a strong choice—especially if you're already using Google's ecosystem—but Qwen3.6's MoE efficiency advantage means faster tokens at the same memory footprint. For a broader look at when local beats cloud on cost and latency, see our full local AI guide for 2026.

Final Verdict

Qwen3.6-35B-A3B is the most interesting open-source model release of 2026 so far for Mac developers. The efficiency story is real: 3B active parameters delivering 30B-equivalent task performance, running at interactive speeds on hardware you already own, under a license that doesn't require a legal review.

Is it better than Claude Opus 4.7 for everything? No. For complex reasoning on long documents or production-grade vision tasks, the cloud APIs maintain an edge. But for offline coding assistance, privacy-sensitive applications, agentic workflows where API costs scale, and developers who simply want to run a frontier-class model locally—this is the one to download today.

The setup is under 10 minutes. The performance is genuinely surprising. And if your benchmark is "draw me a pelican riding a bicycle in SVG," it already won.


Want to compare how local models stack up against hosted APIs for developer workflows? See our Anthropic vs. OpenAI API developer guide and our local AI models with Unsloth & DGX Spark guide.

CL

About ComputeLeap Team

The ComputeLeap editorial team covers AI tools, agents, and products — helping readers discover and use artificial intelligence to work smarter.

💬 Join the Discussion

Have thoughts on this article? Discuss it on your favorite platform:

Join 100+ engineers

Stay ahead of the AI curve

Get weekly insights on AI agents, tools, and engineering delivered to your inbox. No spam, just actionable updates.

No spam. Unsubscribe anytime.