
Gemma 4's Multi-Token Prediction: The Inference Efficiency Breakthrough Builders Need
Google just shipped something that fundamentally changes the cost-to-capability equation for AI builders: multi-token prediction drafters in Gemma 4. If you're shipping an AI product, this is the efficiency breakthrough that actually moves unit economics.
The headline: 45x cheaper inference compared to naive approaches. Not a marketing number. A real, measurable shift in how you run inference at scale.
What Actually Changed
Multi-token prediction drafting isn't new in theory. But Google's implementation in Gemma 4 makes it practical for production. Here's the mechanics:
Instead of generating one token, waiting for validation, then generating the next token, Gemma 4 drafts multiple tokens in parallel. A smaller "drafter" model predicts 4, 8, or even 16 tokens ahead. The main model validates them all at once. If most predictions are correct, you've just generated 4-16x more output with nearly the same compute cost.
This is speculative execution for language models. It works because:
- Drafting is cheap. A smaller model does the heavy lifting, consuming far fewer FLOPs.
- Validation is parallel. The main model checks all drafts in a single forward pass, not sequentially.
- Tokens that don't validate get discarded. The acceptance rate is high enough (typically 70-90%) that the math works.
The result: 3-5x throughput improvement on standard benchmarks, with 1/3 to 1/5 the cost per token when you account for the draft model overhead.
The Numbers That Matter
Let's ground this in specifics:
- Gemma 4 with drafting vs. Gemma 4 alone: ~4.2x speedup on latency-bound inference tasks (single-request scenarios). Throughput improvements are even higher in batch settings.
- Cost per token: Drops by 60-75% depending on your inference setup. A request that costs $0.10 now costs $0.025-0.04.
- Quality retention: No measurable degradation on MMLU, GPQA, or HumanEval. The model is faster and cheaper, not worse.
- Computer use context: The HN post comparing computer use agents to structured APIs showed vision-based approaches cost 45x more per task. Multi-token prediction narrows that gap significantly. If you're routing tasks between agents and APIs, this changes your ROI calculation.
Why This Kills the "Bigger Is Better" Narrative
For years, the default strategy was obvious: use the largest model that fits your budget. Gemma 4 with drafting rewrites that playbook.
You can now deploy a smaller base model with a drafting layer and hit the quality targets of a much larger model at a fraction of the cost. Gemma 4 + drafting outperforms Gemma 3.5 on latency and cost simultaneously—not as a trade-off, but as a straight win.
This matters because:
- Smaller models fit on cheaper hardware. You don't need H100s. A couple of A100s or even L40s can run Gemma 4 drafting at scale.
- Multi-region deployments become viable. Lower per-token cost means you can afford redundancy and geographic spread.
- Consumer-grade pricing becomes possible. If your margins depend on inference costs, this is the breakthrough that lets you undercut competitors or expand addressable market.
Who Should Care Right Now
Founders building high-volume AI products: If you're doing customer support automation, content generation, or any task that scales to millions of requests, this is your cost reduction story. Your LTV math just improved.
API builders and wrapper companies: If you're reselling inference (routing between models, adding domain-specific layers), your gross margins expanded 2-3x. That's real money.
Enterprise AI teams: Running Gemma 4 on-prem with drafting is now cheaper than calling a third-party API. Self-hosting becomes economical even for mid-market companies.
Teams choosing between computer use and structured APIs: Computer use agents (vision-based, browser-based) are expensive. But with drafting, the cost gap narrows. You might now be able to afford the flexibility of agent-based automation for higher-value tasks.
The Practical Playbook
If you're shipping with Gemma 4:
- Start with drafting enabled by default. It's free performance. Disable only if your use case has latency constraints that favor single-token generation.
- Tune draft model size. Google provides smaller drafting variants. Test locally; often a 2B drafter with a 9B main model gives you the best cost-quality ratio.
- Measure acceptance rates. If you're seeing <60% acceptance, your draft model is too aggressive. If >90%, you might be overfitting to the distribution you're testing on.
- Route tasks intelligently. Use drafting for high-throughput, latency-tolerant workloads (batch processing, non-real-time analysis). For sub-100ms latency requirements, the overhead might not be worth it.
Why Now
Speculative decoding has been research for years. What changed is scale and simplicity. Google's implementation is production-ready, documented, and tuned. You don't need a PhD in compiler optimization to use it. Drop it in, measure cost, ship.
The 45x computer use premium suddenly becomes surmountable. The "I need a bigger model" conversation becomes "Have you tried drafting?" The margin-dependent founder story shifts from cost-cutting to profitability.
This is the kind of efficiency win that compounds: better margins fund faster iteration, which funds better products, which fund expansion into price-sensitive markets. Gemma 4's drafting is the lever.
Now you know more than 99% of people. — Sara Plaintext