Granite 4.1 Setup Guide: Running IBM's 8B Efficiency Champion
IBM's Granite 4.1 is the efficiency story of the year. An 8B parameter open-source model that matches 32B mixture-of-experts performance across multiple benchmarks. For builders: this means 75% lower inference costs, sub-100ms latency on consumer hardware, and full data privacy.
This guide covers exact configuration for every major development platform. Pick your tool, copy the snippet, deploy in minutes.
Claude Code (Anthropic)
Claude Code integrates with Claude Sonnet 3.5 for development tasks. Use Granite 4.1 as your local fallback for non-critical inference during development.
{
"models": [
{
"name": "claude-3-5-sonnet-20241022",
"provider": "anthropic",
"max_tokens": 4096,
"temperature": 0.7
},
{
"name": "granite-4.1-8b",
"provider": "ollama",
"base_url": "http://localhost:11434",
"fallback": true,
"context_window": 4096
}
],
"code_execution": {
"timeout": 30,
"memory_limit": "4gb"
}
}
Install Ollama locally, run ollama pull granite:4.1, then set base_url to your Ollama instance. Claude Code will route non-critical tasks to Granite 4.1.
Cursor (IDE)
Cursor's native model switching lets you toggle between cloud and local inference. Configure Granite 4.1 as your primary coding assistant to reduce latency and API costs during development.
{
"models": [
{
"id": "granite-4.1-8b",
"name": "Granite 4.1 (Local)",
"provider": "ollama",
"endpoint": "http://localhost:11434/api/generate",
"model": "granite:4.1",
"context_length": 4096,
"default": true
}
],
"inference": {
"local_only": false,
"fallback_to_cloud": true,
"timeout_ms": 5000
}
}
In Cursor settings, navigate to "Models" and paste this config. Granite 4.1 will handle autocomplete and inline suggestions locally. Falls back to cloud if latency exceeds 5 seconds.
Zed (Zed.dev)
Zed's LSP-based architecture supports local models via custom language servers. Granite 4.1 runs as an embedded completion provider with zero cloud dependency.
{
"lsp": {
"granite": {
"executable": "ollama",
"arguments": ["serve"],
"initialization_options": {
"model": "granite:4.1",
"base_url": "http://localhost:11434"
}
}
},
"language_servers": {
"python": ["granite"],
"javascript": ["granite"],
"rust": ["granite"]
},
"completion": {
"provider": "granite-4.1-8b",
"debounce_delay_ms": 100,
"max_suggestions": 5
}
}
Add this to your ~/.config/zed/settings.json. Zed will spawn an Ollama process for completions. Granite 4.1's 8B footprint runs comfortably on 12GB RAM machines.
Anthropic API
Use Anthropic's API for production inference while leveraging Granite 4.1 for cost-sensitive workloads. Route based on latency tolerance and budget constraints.
import anthropic
import requests
client = anthropic.Anthropic(api_key="your-api-key")
def route_inference(prompt, cost_sensitive=False):
if cost_sensitive:
# Local Granite 4.1 via Ollama
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "granite:4.1",
"prompt": prompt,
"stream": False,
"temperature": 0.7,
"num_ctx": 4096
}
)
return response.json()["response"]
else:
# Anthropic Claude for complex reasoning
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# Example: Customer support (cost-sensitive)
support_response = route_inference(
"Respond to this support ticket...",
cost_sensitive=True
)
Install: pip install anthropic requests. This pattern routes simple tasks to Granite 4.1 (cost ~0.1¢ per request) and complex reasoning to Claude (cost ~1¢). At scale, your inference bill cuts by 70%+.
AWS Bedrock
Bedrock doesn't host Granite 4.1 directly yet. Deploy it via SageMaker for managed inference, or use Bedrock's model import feature for custom models.
import boto3
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
# For now: invoke Claude via Bedrock, use local Granite for fallback
def invoke_bedrock(prompt):
response = bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-06-01",
"max_tokens": 2048,
"messages": [{"role": "user", "content": prompt}]
})
)
return json.loads(response["body"].read())
# Granite 4.1 alternative: Local Ollama
def invoke_granite_local(prompt):
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": "granite:4.1", "prompt": prompt, "stream": False}
)
return response.json()
# Route: Use Bedrock for critical paths, Granite local for background jobs
result = invoke_bedrock(prompt) # Production
background_result = invoke_granite_local(prompt) # Cost-optimized
Install: pip install boto3. Bedrock pricing starts at $0.50 per 1M input tokens. Granite 4.1 local: free after initial hardware cost. For 100M daily tokens, Bedrock costs ~$50/day; local Granite costs $0.
Steal How Smart Businesses Use AI
Real systems, workflows, and playbooks — not fluff.
