
Claude Opus 4.7 just launched, and the short version for builders is this: it’s less about flashy new tricks and more about reliability on hard, long-running work. If Opus 4.6 felt strong but sometimes brittle in multi-step agent flows, 4.7 is aimed directly at that pain.
Anthropic kept pricing the same as Opus 4.6 ($5 per million input tokens, $25 per million output tokens), but changed core behavior in instruction following, multimodal vision, long-horizon execution, and tool-use consistency. That combination matters more than any single benchmark screenshot.
What’s actually different in Claude Opus 4.7
The biggest practical change is that Opus 4.7 appears more dependable when tasks run for a long time and involve multiple tools, files, and intermediate checks.
- More literal instruction following: Anthropic says 4.7 is “substantially better” at following instructions. Translation: prompts written for older Claude models may behave differently now because 4.7 obeys constraints more strictly instead of improvising.
- Better long-run autonomy: Multiple testers report 4.7 keeps going through tool failures and does more self-verification before reporting completion.
- Higher-resolution vision support: 4.7 can process images up to 2,576 pixels on the long edge (~3.75 megapixels), which Anthropic says is over 3x prior Claude models. This is a real capability increase for dense screenshots, charts, technical diagrams, and computer-use workflows.
- New effort level: Anthropic added xhigh effort (between high and max), giving more control over reasoning depth vs latency. In Claude Code, default effort has been raised to xhigh.
- Task budgets in API (public beta): Developers can now guide token spend across longer runs, which is useful for keeping autonomous workflows economically bounded.
- Claude Code upgrades: New /ultrareview command for dedicated bug/design review sessions, plus expanded auto mode for Max users.
- Cyber safeguards: Anthropic says 4.7 includes real-time detection/blocking for prohibited high-risk cyber requests, with a Cyber Verification Program for legitimate security research users.
Benchmark deltas that actually moved
Anthropic’s post combines internal and partner evals, so treat them as directional unless independently replicated. That said, the deltas are specific enough to show where improvement is concentrated.
- CursorBench: Opus 4.7 at 70% vs Opus 4.6 at 58% (12-point jump).
- BigLaw Bench (Harvey): 90.9% at high effort, with improved handling of ambiguous contract editing tasks.
- Internal research-agent benchmark: tied top overall score at 0.715 across six modules; General Finance module improved to 0.813 vs 0.767 on 4.6.
- Notion multi-step workflow eval: +14% over 4.6, with fewer tokens and roughly one-third of the tool errors.
- Rakuten-SWE-Bench: reported 3x more production tasks resolved vs 4.6, plus double-digit code/test quality gains.
- CodeRabbit code review eval: recall improved by 10%+ while precision remained stable.
- Databricks OfficeQA Pro: 21% fewer source-grounded document reasoning errors than 4.6.
- XBOW visual-acuity benchmark: 98.5% for 4.7 vs 54.5% for 4.6 (very large jump on image acuity-dependent workflows).
- Factory Droids: reported 10–15% task-success lift with fewer tool errors.
The pattern is clear: gains are strongest in agentic coding, tool orchestration, visual interpretation at detail, and document-heavy enterprise analysis.
Token and migration reality: better model, different economics
There are two migration gotchas Anthropic explicitly flags.
- Tokenizer update: Same input may map to ~1.0–1.35x more tokens depending content type.
- Higher-effort reasoning: At high/xhigh levels, especially in later turns of agentic runs, output token volume can increase.
So while sticker pricing is unchanged, session-level cost and rate-limit behavior may shift. Anthropic recommends measuring on real traffic, and that’s the right move. Don’t migrate blind off marketing charts.
Who should care immediately
If your product depends on reliability under complexity, Opus 4.7 is worth testing now.
- Devtool builders: CI assistants, code review, debugging copilots, autonomous patching, infra orchestration.
- Agent platform teams: Multi-agent systems where loop resistance, role fidelity, and error recovery decide user trust.
- Enterprise doc/research products: Legal, finance, and compliance workflows that need grounded analysis over large source sets.
- Computer-use and UI automation teams: Anyone parsing high-density screenshots or technical visual artifacts.
If your roadmap includes “one human supervising multiple agents in parallel,” this is exactly the model profile you care about.
Who probably shouldn’t rush
Not every app needs Opus 4.7 as default on day one.
- Basic chat apps: If your traffic is short-turn Q&A and simple generation, you may not get proportional value vs cheaper models.
- Teams without eval harnesses: If you can’t measure completion rate, tool error rate, and cost per successful task, you won’t know if you improved anything.
- Ultra-cost-sensitive workloads: Tokenizer/effort changes can erase gains if you optimize for minimal spend over maximal reliability.
What builders should change this week
If you’re migrating from Opus 4.6, this is the practical checklist:
- Re-tune prompts for literal compliance: Remove ambiguous instructions and specify fallback behavior explicitly.
- A/B test effort levels: Compare high vs xhigh by task class, not globally. Some pipelines benefit a lot; others just get slower/more expensive.
- Enable task budgets: Set token ceilings and stop conditions for long agent loops.
- Track production metrics: Completion success, retries, tool-call error rate, user-visible failures, and cost per successful outcome.
- Revisit vision pipelines: Use high-res inputs only when detail matters; downsample when it doesn’t to control token usage.
- Update safety policy: If you have cyber/security use cases, account for new guardrails and verification pathways.
Bottom line
Claude Opus 4.7 is a meaningful frontier update for builders, but the win condition is specific: hard, multi-step, tool-using, long-context work where reliability matters more than demo quality. The benchmark movement supports that story, and the feature changes (xhigh effort, higher-res vision, task budgets, /ultrareview) reinforce it.
If your app lives in that zone, test 4.7 immediately and likely upgrade. If your app is mostly lightweight chat, wait for your own data before paying the complexity and potential token-cost tradeoffs. This launch is less “everyone switch now” and more “serious agent builders just got a stronger default.”
Now you know more than 99% of people. — Sara Plaintext