Gemma 4 Upgrade Guide: 5-Minute Migration

Gemma 4 Upgrade Guide: Multi-Token Prediction & Breaking Changes

Google's Gemma 4 release brings multi-token prediction drafters to the open-source model ecosystem—a significant leap in inference speed without quality loss. If you're running Gemma 3 or earlier, this guide walks you through the upgrade in 5 minutes, highlighting what breaks, what costs less, and when to hold off.

What Changed: Model IDs & New Architecture

Gemma 4 introduces a two-model inference system: the main model plus a smaller drafter model for speculative decoding. This means your model ID needs updating, and you'll manage two models instead of one.

Old Model ID: google/gemma-3-9b

New Model ID: google/gemma-4-9b

New Drafter Model: google/gemma-4-9b-drafter

The drafter is required for multi-token prediction performance gains. Running Gemma 4 without the drafter model negates most latency benefits.

Breaking Changes You'll Hit

1. Configuration Structure

Gemma 4 requires a new settings.json schema to enable multi-token prediction drafting. Your old config won't work.

Old settings.json:

{
  "model_id": "google/gemma-3-9b",
  "max_tokens": 512,
  "temperature": 0.7
}

New settings.json:

{
  "model_id": "google/gemma-4-9b",
  "drafter_id": "google/gemma-4-9b-drafter",
  "speculative_decoding": true,
  "draft_tokens": 5,
  "max_tokens": 512,
  "temperature": 0.7
}

The speculative_decoding flag is the gating parameter. Set to false to disable multi-token prediction (not recommended unless debugging).

2. Token Counting & Billing

Multi-token prediction changes how tokens are counted. Draft tokens don't incur full cost—only verified tokens from the main model are charged. This is a cost reduction, but if you're tracking token usage for analytics, your numbers will drop unexpectedly.

Old behavior: All tokens = billable tokens

New behavior: Drafter tokens (speculative) = ~10-20% cost; verified tokens (main model) = full cost

Update your cost tracking dashboards to account for this.

3. Latency Metrics Change

Time-to-first-token (TTFT) stays similar, but time-per-token (TpT) improves 2-4x. If you're monitoring SLAs based on old benchmarks, adjust thresholds downward. Responses will complete faster than expected.

Settings.json: Complete Config Edits

  1. Update model IDs: Replace google/gemma-3-* with google/gemma-4-*
  2. Add drafter: Insert "drafter_id": "google/gemma-4-9b-drafter"
  3. Enable speculative decoding: Set "speculative_decoding": true
  4. Tune draft tokens: Start with "draft_tokens": 5; increase to 10 for longer generations (higher latency, better throughput)
  5. Test locally first: Run inference with both old and new configs to validate output quality

Pro tip: Both models need to fit in VRAM. A 9B + 2B drafter setup requires ~14GB. If memory is tight, the drafter is still smaller than the main model—use quantization (4-bit) to fit.

Cost Impact: The Good News

Gemma 4's multi-token prediction reduces effective inference cost by 30-40% depending on workload. For startups running high-volume inference:

If you're on cloud inference (Hugging Face, Lambda, etc.), check if they've updated pricing for Gemma 4. Some providers haven't yet—you might see no cost difference until they optimize their serving stack.

Gotchas & Edge Cases

When NOT to Upgrade

Hold off on Gemma 4 if:

Otherwise, upgrade. The speed gain and cost reduction are substantial for most workloads.

Quick Migration Checklist

  1. Back up current settings.json
  2. Update model IDs to google/gemma-4-9b and google/gemma-4-9b-drafter
  3. Add speculative decoding config keys
  4. Pull both models locally or confirm cloud provider supports them
  5. Test inference on representative prompts
  6. Update cost tracking and latency dashboards
  7. Monitor first 24 hours for regressions

Gemma 4's multi-token prediction is production-ready. The upgrade pays for itself through faster inference and lower compute costs. Start testing today.

Now you know more than 99% of people. — Sara Plaintext