The headline capability shifts that matter for agents
Opus 4.7 is not just “a bit smarter” than its predecessor. For people building AI agents, three specific shifts matter: materially better agentic coding, a vision upgrade that finally makes screenshot-first workflows practical, and /ultrareview as a built-in self-critique loop.
Agentic coding and scaled tool use: where Opus 4.7 actually pulls ahead
Anthropic is explicit: Claude Opus 4.7 beats GPT‑5.4 and Gemini 3.1 Pro on agentic coding, scaled tool use, agentic computer use, and financial analysis benchmarks. That’s a narrow but strategically crucial slice of the LLM landscape — exactly where serious agents live.
The practical consequences:
- Deeper multi-step coding. In multi-file, multi-service repos, Opus 4.7 is better at maintaining a coherent “plan of attack”: e.g., modify a backend service, update a shared library, patch tests, adjust CI config, and update infra as code without losing track two tools or three files later.
- More reliable tool orchestration. In loops where the model calls a code runner, a file system, an HTTP client, and a browser automation tool, Opus 4.7 shows a lower error cascade rate — fewer bad assumptions compounding across tools, fewer “I forgot I already updated that file” moments.
- Better financial reasoning. For agents doing forecasting, portfolio rebalancing, or cost-optimization (cloud, marketing, logistics), Opus 4.7 holds intermediate calculations more faithfully and is better at reconciling multiple data sources (e.g., three CSVs + a database extract + a PDF report) into a consistent view.
If your current agents regularly fail on 8–20 step plans — particularly in real codebases or complex workflows — Opus 4.7 is where that failure rate drops enough to matter.
2,576 px vision: screenshot-driven agents are now credible
Previous Claude models were constrained on image resolution; screenshot agents had to work with aggressively downscaled inputs, losing the fine-grained cues that matter on modern UIs. Opus 4.7 lifts the long-edge limit to 2,576 pixels, roughly 3Ă— the prior Claude cap.
Implications for agentic computer use:
- Full-screen desktop shots. A standard 1440p or many 4K desktop screenshots can now be passed with far less aggressive downscaling. Button labels, dense tables, multi-pane IDEs, and OS chrome survive intact.
- Multi-panel web apps. Tools like Salesforce, Figma, major trading dashboards, and internal admin consoles frequently pack multiple views into a single screen. At 2,576 px, the model can parse small fonts, status badges, error labels, tiny toggles — the things that make or break automation.
- Higher-confidence cursor-level actions. Agents can more reliably disambiguate visually similar buttons, menu items, and text fields. That translates to fewer off-by-one clicks and fewer “click the wrong thing and lose the state” failures.
The net: Opus 4.7 makes it far more realistic to run agents that understand what they’re seeing on a live desktop or complex web UI, not just “guess based on half-legible JPEGs.” If your roadmap includes RPA-like agents or browser workers, this is the first Claude generation that deserves a serious proof-of-concept.
/ultrareview: built-in senior reviewer for your agent
Claude Code’s new /ultrareview command simulates a senior human reviewer. Beyond linting or style feedback, it’s tuned to flag:
- Subtle design flaws (poor abstractions, leaky boundaries, untestable seams)
- Logic gaps (edge cases, race conditions, partial updates, error handling holes)
- Hidden security and reliability risks in code changes
For agents, this is a ready-made self-critique loop:
- Agent generates a patch or new module.
- Process wraps the output with /ultrareview and feeds it back to Opus 4.7.
- Agent revises based on critique, or routes to a human if the review flags risk.
Instead of building your own complex critique persona or second-model pipeline, you can standardize on ultrareview for high-risk changes (auth logic, payment flows, infra config, security-sensitive routines). This is especially powerful in CI: every agent-authored PR can get an automatic “senior engineer” pass before a human ever sees it.
The cost calculus changes: tokenizer 1.0–1.35× and budgets
Anthropic kept Opus 4.7 pricing flat at $5 / $25 per million tokens (input / output) with model ID claude-opus-4-7. But two properties now matter for your budget models:
- Tokenizer inflation. Input token counts are now 1.0–1.35× higher than in 4.6 for the same raw text.
- “Thinks harder” at higher effort levels. At high and xhigh, Opus 4.7 tends to generate longer chains of thought and more detailed outputs ⇒ higher output token usage.
Re-benchmark your effective costs, not list prices
If you simply swap claude-opus-4-6 for claude-opus-4-7 and keep everything else constant, your cost-per-call will not stay flat:
- Input inflation. For a typical agent prompt of 8–15k tokens, a 20–35% tokenization bump translates directly to 20–35% higher input cost at the same price per million tokens.
- Output stretch. If your agents are running at high or xhigh effort, expect 10–30% more output tokens on complex tasks because the model is encouraged to reason more explicitly.
For a long-running agent loop (say 30–100 calls per task) that previously cost you $0.40 per end-to-end run, it’s not hard to see that climbing into the $0.55–$0.70 band unless you actively manage tokens and effort levels.
CTO takeaway: treat Opus 4.7 as a new economic regime, even at the same list price. You need fresh telemetry: median tokens in/out per call, per effort level, per task type, and per agent.
The xhigh effort level — a new tool for cost/quality control on agent loops
Anthropic now exposes a new xhigh effort level between high and max. Conceptually:
- normal: Lowest latency, least reasoning overhead.
- high: More careful reasoning, modest cost/latency premium.
- xhigh: Noticeably deeper chain-of-thought, but still practical for interactive loops.
- max: Anthropic’s “no-holds-barred” depth; expensive and slower, reserved for especially hard tasks.
For agent builders, xhigh enables nuanced control of the cost/quality frontier.
Pattern 1: hierarchical effort routing
- Normal for cheap, high-frequency operations:
- Parsing tool outputs
- Simple transformations (reformat JSON, rewrite text snippets)
- Routine browser steps (“click next”, “save and close”)
- High for moderate reasoning:
- Summarizing one or two docs
- Basic code edits within a single function or file
- Simple financial calculations on limited data
- xhigh for core “hard steps”:
- Planning multi-step agent strategies
- Refactoring multi-file components
- Reconciling conflicting data sources
- Security-sensitive changes with ultrareview in the loop
- Max only for explicitly whitelisted tasks:
- Critical incident analysis
- High-stakes financial or compliance decisions (with human review)
In other words, xhigh becomes the default setting for the 5–20% of calls where marginal quality matters most — without paying the full “max” tax everywhere.
Pattern 2: dynamic escalation inside loops
Effective agents can now adapt effort per step:
- Start at normal. If the model expresses uncertainty (“I’m not sure…”, “There are several conflicting…”) or fails a simple self-check, escalate.
- Retry at high. If internal tests or tools still flag issues (e.g., failing test suite, inconsistent sums), escalate further.
- Invoke xhigh for the “stuck” step, then drop back to normal/high for the rest.
This approach lets you concentrate your token budget where the agent is actually struggling, not where it’s cruising.
The cybersecurity safeguards — implications for red-team and pentest builders
Opus 4.7 is Anthropic’s first model with automated systems to detect and block prohibited cybersecurity requests. It is also paired with a Cyber Verification Program that grants enhanced access to vetted pentesters and vulnerability researchers.
What changes in practice
- General users:
- More aggressive refusal behavior for exploit generation, payload crafting, and step-by-step intrusion guidance.
- Higher likelihood that “find vulnerabilities in X and craft an exploit” styles of prompt simply fail, even if your use case is legitimate.
- Verified security teams:
- Through the Cyber Verification Program, you can get access pathways where Opus 4.7 is allowed to provide more detailed security reasoning.
- Anthropic is explicit that this is aimed at legitimate defense: pentest firms, internal red teams, and open-source security tools.
Mythos is the context here: Anthropic reports that their unreleased model has already found thousands of zero-days across “every major OS and web browser,” and is capable of “hacking major banking systems if misused.” That capability is precisely why Opus 4.7 is getting defensive guardrails now.
If you’re building security agents
You should assume:
- “Off-the-shelf” Opus 4.7 in regular API access will be more constrained than GPT‑5.4 and Gemini 3.1 Pro on offensive security details.
- To build agents that perform end-to-end red-team workflows (scanning, exploitation, lateral movement simulation), you will likely need:
- Enrollment in Anthropic’s Cyber Verification Program.
- Clear logging, scoping, and access controls around your use of the model.
- Your prompts and system design should be unambiguous about defensive intent: “generate test payloads for this controlled lab environment,” “analyze this binary captured from our own systems,” etc.
The trade-off for you as a builder: Opus 4.7 will be less “open ended” for ad hoc hacking questions, but — once verified — you’re plugging into an ecosystem whose frontier model (Mythos) has demonstrably high security intelligence. The upside is better defensive automation; the downside is stricter gates and more compliance overhead.
The Mythos question: Anthropic’s frontier model and strategic signaling
Anthropic has publicly conceded that Claude Opus 4.7 “does NOT match” their internal Mythos model. That admission, plus the Mythos Preview blog at red.anthropic.com/2026/mythos-preview/, reshapes how CTOs should think about vendor roadmaps.
Known facts about Mythos:
- It is Anthropic’s current frontier model, more capable than Opus 4.7.
- It is “strikingly capable” at computer security tasks.
- Anthropic has used Mythos to discover thousands of zero-day vulnerabilities across every major OS and web browser.
- They judge it capable of hacking major banking systems if misused.
- It is not publicly released due to safety concerns.
- It is accessible only via “Project Glasswing” to critical industry partners and select open-source defenders.
- US administration officials were briefed prior to launch; mainstream outlets from Bloomberg to Scientific American have framed it as “too dangerous for release.”
Strategic implications for builders
- The frontier line is now explicitly gated. The best model Anthropic has is no longer the one you can simply buy. Your agents are, by design, one notch below their internal capability frontier.
- Security is the first domain where this matters. If your business lives in critical infrastructure, finance, or cloud security, the competitive edge may come from access to Mythos via Project Glasswing, not from using the same public model everyone else can call.
- Expect the pattern to spread. As GPT‑class and Gemini‑class models also push into dangerous territories (bio, cyber, disinformation), you should plan for a world where high-end capabilities are segmented:
- Public “general” models (like Opus 4.7)
- Partner-only frontier tiers (like Mythos)
- Domain-specific “red” endpoints under strict compliance
For your architecture, the key is decoupling: write your agent frameworks so that swapping “Opus 4.7 → Mythos” (or equivalent future frontier models) is a configuration change, not a rewrite. That way, if and when you qualify for access, your core orchestration, observability, and guardrails remain the same.
Practical recommendations: migration, Sonnet, and where to wait
Given all of this, how should you actually reshuffle your agents across the Claude family?
Agents to migrate to Opus 4.7 now
- Complex coding agents where:
- Tasks span multiple services, repos, or tech stacks.
- Error rates today come from brittle long-range reasoning, not just syntax mistakes.
- You can amortize a 20–40% token cost increase over expensive human time.
- Screenshot-driven or desktop/browser agents that:
- Operate on dense dashboards or multi-panel UIs.
- Need fine-grained UI understanding (small fonts, color-coded states, nested modals).
- Financial analysis and planning agents doing:
- Budget forecasting and scenario analysis for large orgs.
- Portfolio or treasury modeling where analytic mistakes are expensive.
- High-stakes content agents (e.g., compliance summaries, legal-ish analyses) where extra reasoning depth and ultrareview-style critique materially reduce human review time.
Agents to leave on Sonnet (or equivalent mid-tier) for now
- High-volume, low-stakes operations: tagging, routing, FAQ answering, email triage, simple customer support flows.
- Simple code gen helpers: one-file scripts, boilerplate scaffolding, test stub generation, documentation updates.
- Lightweight data wrangling: CSV reshaping, basic summarization, routine business intelligence queries where human verification is already cheap.
In these domains, the incremental quality from Opus 4.7 rarely justifies the cost and token inflation.
Agents to “wait on” pending more data
- Autonomous long-horizon agents. If you’re experimenting with agents that run unsupervised for hours or days, the combination of higher tokenization and longer outputs can create runaway bills. Keep them on cheaper models until you have tight usage caps and xhigh/max routing well tuned.
- Security agents. Until your org is enrolled in the Cyber Verification Program and you’ve tested refusal patterns, do not anchor your roadmaps on Opus 4.7 as the core “brains” for exploit generation or red teaming.
- Domain-specialized agents with external heuristics. If your value comes largely from external solvers (mathematical engines, constraint solvers, code execution) and the LLM acts mostly as glue, the jump to Opus 4.7 may deliver limited marginal benefit; run small A/B evaluations first.
The competitive landscape: GPT‑5.4 vs Opus 4.7 vs Gemini 3.1 Pro vs (eventually) Mythos
At a high level, here’s how the current generation of flagship models line up for agent builders.
Claude Opus 4.7
- Strengths:
- Best-in-class agentic coding and scaled tool use (per Anthropic’s own benchmarks vs GPT‑5.4 and Gemini 3.1 Pro).
- 2,576 px vision enabling credible screenshot / desktop agents.
- /ultrareview offering native self-critique for code and design.
- xhigh effort level and “think harder” behavior tuning for agent loops.
- Strong availability footprint: Amazon Bedrock, Google Vertex AI, Microsoft Foundry.
- Trade-offs:
- Token inflation (1.0–1.35×) vs Opus 4.6, with potentially higher output lengths.
- Stricter cybersecurity safeguards out