Gemma 4 Upgrade Guide: Multi-Token Prediction & Breaking Changes
Google's Gemma 4 release brings multi-token prediction drafters to the open-source model ecosystem—a significant leap in inference speed without quality loss. If you're running Gemma 3 or earlier, this guide walks you through the upgrade in 5 minutes, highlighting what breaks, what costs less, and when to hold off.
What Changed: Model IDs & New Architecture
Gemma 4 introduces a two-model inference system: the main model plus a smaller drafter model for speculative decoding. This means your model ID needs updating, and you'll manage two models instead of one.
Old Model ID: google/gemma-3-9b
New Model ID: google/gemma-4-9b
New Drafter Model: google/gemma-4-9b-drafter
The drafter is required for multi-token prediction performance gains. Running Gemma 4 without the drafter model negates most latency benefits.
Breaking Changes You'll Hit
1. Configuration Structure
Gemma 4 requires a new settings.json schema to enable multi-token prediction drafting. Your old config won't work.
Old settings.json:
{
"model_id": "google/gemma-3-9b",
"max_tokens": 512,
"temperature": 0.7
}
New settings.json:
{
"model_id": "google/gemma-4-9b",
"drafter_id": "google/gemma-4-9b-drafter",
"speculative_decoding": true,
"draft_tokens": 5,
"max_tokens": 512,
"temperature": 0.7
}
The speculative_decoding flag is the gating parameter. Set to false to disable multi-token prediction (not recommended unless debugging).
2. Token Counting & Billing
Multi-token prediction changes how tokens are counted. Draft tokens don't incur full cost—only verified tokens from the main model are charged. This is a cost reduction, but if you're tracking token usage for analytics, your numbers will drop unexpectedly.
Old behavior: All tokens = billable tokens
New behavior: Drafter tokens (speculative) = ~10-20% cost; verified tokens (main model) = full cost
Update your cost tracking dashboards to account for this.
3. Latency Metrics Change
Time-to-first-token (TTFT) stays similar, but time-per-token (TpT) improves 2-4x. If you're monitoring SLAs based on old benchmarks, adjust thresholds downward. Responses will complete faster than expected.
Settings.json: Complete Config Edits
- Update model IDs: Replace
google/gemma-3-*withgoogle/gemma-4-* - Add drafter: Insert
"drafter_id": "google/gemma-4-9b-drafter" - Enable speculative decoding: Set
"speculative_decoding": true - Tune draft tokens: Start with
"draft_tokens": 5; increase to 10 for longer generations (higher latency, better throughput) - Test locally first: Run inference with both old and new configs to validate output quality
Pro tip: Both models need to fit in VRAM. A 9B + 2B drafter setup requires ~14GB. If memory is tight, the drafter is still smaller than the main model—use quantization (4-bit) to fit.
Cost Impact: The Good News
Gemma 4's multi-token prediction reduces effective inference cost by 30-40% depending on workload. For startups running high-volume inference:
- Latency: 2-4x faster per token
- Compute: 30-40% fewer GPU cycles
- Throughput: More requests served per second on same hardware
- Token cost: Fewer total tokens needed per request
If you're on cloud inference (Hugging Face, Lambda, etc.), check if they've updated pricing for Gemma 4. Some providers haven't yet—you might see no cost difference until they optimize their serving stack.
Gotchas & Edge Cases
- Drafter model not found: Ensure both models are downloaded. If using a local cache, pull
google/gemma-4-9b-drafterexplicitly. - Determinism broken: Speculative decoding introduces mild non-determinism (same seed doesn't guarantee same output). If you need reproducibility, disable speculative decoding or seed the drafter separately.
- Batching complexity: Multi-token prediction plays differently with batched inference. Test batch sizes before production.
- Fine-tuned models: If you've fine-tuned Gemma 3, you cannot use Gemma 4 checkpoints directly. Retrain or use an adapter.
When NOT to Upgrade
Hold off on Gemma 4 if:
- You have fine-tuned Gemma 3 models in production and can't retrain
- You need absolute output determinism (same seed = same output)
- Your VRAM is under 12GB and quantization isn't an option
- You're on an old inference framework (TensorFlow) that doesn't support speculative decoding yet
- You're batch-processing and see performance regressions during testing
Otherwise, upgrade. The speed gain and cost reduction are substantial for most workloads.
Quick Migration Checklist
- Back up current
settings.json - Update model IDs to
google/gemma-4-9bandgoogle/gemma-4-9b-drafter - Add speculative decoding config keys
- Pull both models locally or confirm cloud provider supports them
- Test inference on representative prompts
- Update cost tracking and latency dashboards
- Monitor first 24 hours for regressions
Gemma 4's multi-token prediction is production-ready. The upgrade pays for itself through faster inference and lower compute costs. Start testing today.
Now you know more than 99% of people. — Sara Plaintext
