Gemma 4 Upgrade Guide: 5-Minute Migration

Gemma 4 Upgrade Guide: Multi-Token Prediction & Breaking Changes

Google's Gemma 4 release brings multi-token prediction drafters to the open-source model ecosystem—a significant leap in inference speed without quality loss. If you're running Gemma 3 or earlier, this guide walks you through the upgrade in 5 minutes, highlighting what breaks, what costs less, and when to hold off.

What Changed: Model IDs & New Architecture

Gemma 4 introduces a two-model inference system: the main model plus a smaller drafter model for speculative decoding. This means your model ID needs updating, and you'll manage two models instead of one.

Old Model ID: google/gemma-3-9b

New Model ID: google/gemma-4-9b

New Drafter Model: google/gemma-4-9b-drafter

The drafter is required for multi-token prediction performance gains. Running Gemma 4 without the drafter model negates most latency benefits.

Breaking Changes You'll Hit

1. Configuration Structure

Gemma 4 requires a new settings.json schema to enable multi-token prediction drafting. Your old config won't work.

Old settings.json:

{
  "model_id": "google/gemma-3-9b",
  "max_tokens": 512,
  "temperature": 0.7
}

New settings.json:

{
  "model_id": "google/gemma-4-9b",
  "drafter_id": "google/gemma-4-9b-drafter",
  "speculative_decoding": true,
  "draft_tokens": 5,
  "max_tokens": 512,
  "temperature": 0.7
}

The speculative_decoding flag is the gating parameter. Set to false to disable multi-token prediction (not recommended unless debugging).

2. Token Counting & Billing

Multi-token prediction changes how tokens are counted. Draft tokens don't incur full cost—only verified tokens from the main model are charged. This is a cost reduction, but if you're tracking token usage for analytics, your numbers will drop unexpectedly.

Old behavior: All tokens = billable tokens

New behavior: Drafter tokens (speculative) = ~10-20% cost; verified tokens (main model) = full cost

Update your cost tracking dashboards to account for this.

3. Latency Metrics Change

Time-to-first-token (TTFT) stays similar, but time-per-token (TpT) improves 2-4x. If you're monitoring SLAs based on old benchmarks, adjust thresholds downward. Responses will complete faster than expected.

Settings.json: Complete Config Edits

Update model IDs: Replace google/gemma-3-* with google/gemma-4-*
Add drafter: Insert "drafter_id": "google/gemma-4-9b-drafter"
Enable speculative decoding: Set "speculative_decoding": true
Tune draft tokens: Start with "draft_tokens": 5; increase to 10 for longer generations (higher latency, better throughput)
Test locally first: Run inference with both old and new configs to validate output quality

Pro tip: Both models need to fit in VRAM. A 9B + 2B drafter setup requires ~14GB. If memory is tight, the drafter is still smaller than the main model—use quantization (4-bit) to fit.

Cost Impact: The Good News

Gemma 4's multi-token prediction reduces effective inference cost by 30-40% depending on workload. For startups running high-volume inference:

Latency: 2-4x faster per token
Compute: 30-40% fewer GPU cycles
Throughput: More requests served per second on same hardware
Token cost: Fewer total tokens needed per request

If you're on cloud inference (Hugging Face, Lambda, etc.), check if they've updated pricing for Gemma 4. Some providers haven't yet—you might see no cost difference until they optimize their serving stack.

Gotchas & Edge Cases

Drafter model not found: Ensure both models are downloaded. If using a local cache, pull google/gemma-4-9b-drafter explicitly.
Determinism broken: Speculative decoding introduces mild non-determinism (same seed doesn't guarantee same output). If you need reproducibility, disable speculative decoding or seed the drafter separately.
Batching complexity: Multi-token prediction plays differently with batched inference. Test batch sizes before production.
Fine-tuned models: If you've fine-tuned Gemma 3, you cannot use Gemma 4 checkpoints directly. Retrain or use an adapter.

When NOT to Upgrade

Hold off on Gemma 4 if:

You have fine-tuned Gemma 3 models in production and can't retrain
You need absolute output determinism (same seed = same output)
Your VRAM is under 12GB and quantization isn't an option
You're on an old inference framework (TensorFlow) that doesn't support speculative decoding yet
You're batch-processing and see performance regressions during testing

Otherwise, upgrade. The speed gain and cost reduction are substantial for most workloads.

Quick Migration Checklist

Back up current settings.json
Update model IDs to google/gemma-4-9b and google/gemma-4-9b-drafter
Add speculative decoding config keys
Pull both models locally or confirm cloud provider supports them
Test inference on representative prompts
Update cost tracking and latency dashboards
Monitor first 24 hours for regressions

Gemma 4's multi-token prediction is production-ready. The upgrade pays for itself through faster inference and lower compute costs. Start testing today.

Now you know more than 99% of people. — Sara Plaintext

Google Just Made AI Way Faster With This New Gemma 4 Thing