Google Just Made AI Way Faster and I'm Obsessed

Gemma 4 Setup Guide: Multi-Token Prediction Across Dev Tools

Faster inference, lower costs, same quality. Deploy Google's latest open-source model with speculative decoding.

What's New: Google released Gemma 4 with multi-token prediction drafters—smaller models that predict multiple tokens simultaneously, verified by the full model. Result: 2-3x faster inference without quality loss. Perfect for startups optimizing inference costs.

Claude Code (Anthropic's IDE)

Claude Code runs Claude directly inside your editor. To use Gemma 4 as a reference or companion model, configure your local environment variables and point to a Gemma 4 endpoint.

GEMMA_4_ENDPOINT=https://your-inference-server.com/v1
GEMMA_4_API_KEY=your_api_key_here
GEMMA_4_MODEL=google/gemma-4-9b-it
ENABLE_MULTI_TOKEN_PREDICTION=true
DRAFTER_MODEL=google/gemma-4-drafter-3b

In your Claude Code workspace settings, add Gemma 4 as a secondary model for code suggestions and completions. This allows you to compare outputs between Claude and Gemma 4's faster inference path. Multi-token prediction is particularly useful for generating boilerplate and repetitive code patterns.

Cursor (AI-First Code Editor)

Cursor supports custom model endpoints through its settings panel. Gemma 4 integrates via OpenAI-compatible API format. Configure Gemma 4 with multi-token prediction enabled for faster completions.

[models.gemma4]
provider = "openai-compatible"
base_url = "https://your-inference-endpoint.com/v1"
api_key = "your_gemma_4_api_key"
model = "google/gemma-4-9b-instruct"
enable_speculative_decoding = true
drafter_model = "google/gemma-4-drafter-3b"
context_window = 8192
max_tokens = 4096

In Cursor's command palette, select "Models" and add Gemma 4 as your primary or fallback model. The multi-token prediction feature will automatically batch token generation, reducing latency for code completions. Particularly effective for longer function generations where speculative decoding shines.

Zed (High-Performance Editor)

Zed's language server protocol (LSP) integration allows custom model backends. Deploy Gemma 4 via a local or remote inference server, then point Zed's AI configuration to it.

{
  "assistant": {
    "default_model": {
      "provider": "openai",
      "model": "gemma-4-9b-it"
    },
    "openai": {
      "api_url": "https://your-gemma-inference.com/v1",
      "api_key": "your_key",
      "low_speed_timeout_in_seconds": 30
    },
    "inline_alternative_models": [
      {
        "provider": "openai",
        "model": "gemma-4-drafter-3b"
      }
    ]
  }
}

Zed's lightweight architecture pairs perfectly with Gemma 4's efficient multi-token prediction. The drafter model runs locally on most machines, speeding up inline completions. Configure your inference server to enable speculative decoding for best performance.

Anthropic API (Direct Integration)

While Anthropic's API runs Claude models, you can orchestrate Gemma 4 alongside it. Use the API to call Gemma 4 endpoints for cost-sensitive tasks, reserving Claude for complex reasoning.

import anthropic
import requests

client = anthropic.Anthropic()

def call_gemma_4_fast(prompt):
    response = requests.post(
        "https://your-inference-endpoint/v1/messages",
        headers={"Authorization": f"Bearer YOUR_API_KEY"},
        json={
            "model": "google/gemma-4-9b-instruct",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 1024,
            "speculative_decoding": {
                "drafter_model": "google/gemma-4-drafter-3b",
                "num_tokens": 4
            }
        }
    )
    return response.json()["content"][0]["text"]

result = call_gemma_4_fast("Write a Python function for...")

This hybrid approach lets you route simple queries to Gemma 4 (fast, cheap) and complex tasks to Claude. Multi-token prediction on Gemma 4 means your inference costs drop while latency remains minimal.

AWS Bedrock (Managed Service)

Bedrock provides managed inference for Gemma models. Request access to Gemma 4, then configure it with on-demand throughput to leverage multi-token prediction.

import boto3

bedrock = boto3.client("bedrock-runtime", region_name="us-west-2")

response = bedrock.invoke_model(
    modelId="google.gemma-4-9b-instruct",
    body=json.dumps({
        "prompt": "Your prompt here",
        "max_tokens": 2048,
        "temperature": 0.7,
        "speculative_decoding_config": {
            "enabled": True,
            "drafter_model": "google.gemma-4-drafter-3b",
            "num_draft_tokens": 4
        }
    })
)

result = json.loads(response["body"].read())
print(result["outputs"][0]["text"])

Bedrock handles scaling and multi-token prediction automatically. No infrastructure management needed. For startups, this eliminates DevOps overhead while delivering Gemma 4's speed benefits. Pricing scales with usage—perfect for variable workloads.

Google Vertex AI (GCP Native)

Vertex AI offers Gemma 4 with native multi-token prediction support through the Model Garden. Deploy via Prediction API for production workloads.

from google.cloud import aiplatform

aiplatform.init(project="your-project", location="us-central1")

endpoint = aiplatform.Endpoint(
    endpoint_name="projects/YOUR_PROJECT/locations/us-central1/endpoints/gemma-4-multi-token"
)

response = endpoint.predict(
    instances=[{
        "prompt": "Your prompt here",
        "parameters": {
            "max_tokens": 2048,
            "temperature": 0.7,
            "speculative_decoding": True,
            "num_draft_tokens": 4
        }
    }]
)

print(response.predictions[0])

Vertex AI's managed infrastructure automatically optimizes multi-token prediction across TPU clusters. Deploy once, scale infinitely. For teams already on GCP, this is the fastest path to production Gemma 4 inference with built-in observability and monitoring.

Cost Impact: Multi-token prediction reduces inference latency by 40-60%, directly lowering per-request costs. Startups building LLM products can now compete on margins without sacrificing speed. Consider fine-tuning Gemma 4 on proprietary data—the drafter architecture supports efficient LoRA adaptation, cutting training time in half.

Next Steps: Choose your platform, enable speculative decoding, and benchmark latency improvements in your workload. Gemma 4's open-source license means zero licensing costs. Deploy to production within days, not weeks.

Now you know more than 99% of people. — Sara Plaintext