Anthropic Releases Claude Opus 4.7: The Most Capable Public AI Model — With a Catch – AAXYZ

Anthropic Releases Claude Opus 4.7: The Most Capable Public AI Model — With a Catch

On April 16, 2026, Anthropic announced the release of Claude Opus 4.7, positioning it as the company’s most capable publicly available AI model to date. But there’s a twist: Anthropic is openly acknowledging that Opus 4.7 is deliberately less powerful than its internal Claude Mythos Preview model — and that gap, they say, is by design.

The release marks a significant milestone in the AI landscape. Opus 4.7 is immediately available across all Claude products, the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry — all at the same pricing as Opus 4.6 ($5 per million input tokens, $25 per million output tokens). But beneath the surface, this model represents a fundamentally new approach to balancing capability with safety.

What Makes Opus 4.7 a “Notable Improvement”

Anthropic’s own assessment is that Opus 4.7 is “a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks.” Early-access testers echo this sentiment with specific, measurable results:

SWE-bench Pro: 64.3% — a 10.9-point jump from Opus 4.6’s 53.4%, and surpassing GPT-5.4’s 57.7%. This is the largest improvement in the release.
SWE-bench Verified: 87.6% — up from 80.8% for Opus 4.6, approaching the 93.9% ceiling set by Mythos Preview.
CursorBench: 70% — a substantial leap from Opus 4.6’s 58%, representing a 13-point gain on Cursor’s 93-task coding benchmark.
OSWorld-Verified: 78.0% — outperforming GPT-5.4 (75.0%) and edging past Opus 4.6 (72.7%).
Biology (GPQA subset): 74.0% — a staggering 138% relative improvement over Opus 4.6’s 30.9%.

Caitlin Colgrove, Co-Founder and CTO of Hex, summarized the upgrade succinctly: “Claude Opus 4.7 is the strongest model Hex has evaluated. It correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks, and it resists dissonant-data traps that even Opus 4.6 falls for. It’s a more intelligent, more efficient Opus 4.6: low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6.”

The Vision Breakthrough: 3x Higher Resolution

One of the most tangible upgrades is in visual processing. Opus 4.7 can accept images up to 2,576 pixels on the long edge — approximately 3.75 megapixels, more than triple the resolution of prior Claude models. This isn’t an API parameter change; it’s a fundamental model-level improvement.

The practical implications are significant:

Computer-use agents can now read dense screenshots with far greater accuracy
Data extraction from complex diagrams and charts is substantially improved
Pixel-perfect reference work — matching design mockups to code — becomes more reliable

Oege de Moor, CEO of XBOW, reported dramatic results: “For the computer-use work that sits at the heart of XBOW’s autonomous penetration testing, the new Claude Opus 4.7 is a step change: 98.5% on our visual-acuity benchmark versus 54.5% for Opus 4.6.” That’s nearly a doubling of accuracy on visual tasks critical to autonomous security operations.

Long-Horizon Autonomy: AI That Doesn’t Give Up

Perhaps the most consequential improvement is in sustained, autonomous reasoning. Opus 4.7 is optimized to work coherently for hours, pushing through difficult problems rather than abandoning them — a limitation that has plagued even the most advanced AI models in production settings.

Scott Wu, CEO of Devin, described the impact: “Claude Opus 4.7 takes long-horizon autonomy to a new level in Devin. It works coherently for hours, pushes through hard problems rather than giving up, and unlocks a class of deep investigation work we couldn’t reliably run before.”

The model also devises ways to verify its own outputs before reporting back — a form of self-correction that reduces the need for human oversight. Sean Ward, CEO and Co-Founder of an AI engineering firm, shared a striking example: “Claude Opus 4.7 autonomously built a complete Rust text-to-speech engine from scratch — neural model, SIMD kernels, browser demo — then fed its own output through a speech recognizer to verify it matched the Python reference. Months of senior engineering, delivered autonomously.”

The Mythos Preview Gap: Why Less Is More

Here’s where things get interesting. Anthropic’s internal Claude Mythos Preview model outperforms Opus 4.7 on every benchmark that both models have been tested on:

Mythos Preview scores 77.8% on SWE-bench Pro compared to Opus 4.7’s 64.3%, 93.9% on SWE-bench Verified versus 87.6%, and 79.6% on OSWorld versus 78.0%. It also has more advanced cyber capabilities and is, by Anthropic’s own assessment, “the best-aligned model” the company has ever trained.

So why release a deliberately weaker model? The answer lies in Project Glasswing, Anthropic’s initiative to develop and test cyber safeguards before releasing more powerful models to the public.

As Anthropic explains: “Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities).”

Opus 4.7 serves as a test vehicle — a model powerful enough to be useful but limited enough to serve as a proving ground for safety mechanisms. “What we learn from the real-world deployment of these safeguards,” Anthropic states, “will help us work towards our eventual goal of a broad release of Mythos-class models.”

This approach represents a significant shift in AI development philosophy: rather than racing to release the most capable model, Anthropic is deliberately pacing its releases to validate safety measures at each step.

New Agent Features: xhigh, Task Budgets, and Auto Mode

Alongside the model itself, Anthropic introduced several new features designed specifically for autonomous agent workflows:

New “xhigh” effort level: A new tier between “high” and “max” that offers finer control over the reasoning-versus-latency tradeoff. Claude Code’s default has been raised to xhigh across all plans.
Task Budgets (API, public beta): Developers can now cap Claude’s token spend across a run, guiding the model to prioritize work efficiently over longer sessions.
Auto Mode (Max users in Claude Code): A new permissions option where Claude makes decisions on the user’s behalf — a middle ground between approving every step and skipping permissions entirely.
/ultrareview slash command: Produces a dedicated review session that reads through code changes and flags bugs and design issues. Pro and Max users receive three free ultrareviews.

Sarah Sachs, AI Lead at Notion, highlighted the practical impact: “For complex multi-step workflows, Claude Opus 4.7 is a clear step up: plus 14% over Opus 4.6 at fewer tokens and a third of the tool errors. It’s the first model to pass our implicit-need tests, and it keeps executing through tool failures that used to stop Opus cold.”

Safety: A Balanced Assessment

Anthropic’s safety evaluation of Opus 4.7 reveals a nuanced picture. Overall, the model maintains a safety profile similar to Opus 4.6, with low rates of concerning behaviors such as deception, sycophancy, and cooperation with misuse.

However, there are specific improvements and trade-offs:

Improved: Honesty and resistance to malicious “prompt injection” attacks
Weakened: A modestly increased tendency to give overly detailed harm-reduction advice on controlled substances
Cyber safeguards: The model automatically detects and blocks requests indicating prohibited or high-risk cybersecurity uses

Anthropic’s alignment assessment concludes that Opus 4.7 is “largely well-aligned and trustworthy, though not fully ideal in its behavior.” For legitimate cybersecurity professionals, the company offers a Cyber Verification Program that provides access to more capable cyber features for vulnerability research, penetration testing, and red-teaming purposes.

Pricing and Token Economics

Opus 4.7 maintains the same pricing as its predecessor — $5 per million input tokens and $25 per million output tokens — making it approximately 5x more expensive than Sonnet per token. The context window remains at approximately 1 million tokens.

However, there’s an important migration consideration: Opus 4.7 uses an updated tokenizer that can map the same input to 1.0–1.35x more tokens depending on content type. Anthropic reports that despite this, net token usage across all effort levels is improved on internal coding evaluations, as the model thinks more efficiently at higher effort levels.

Michele Catasta, President of Replit, noted the efficiency gains: “For the work our users do every day, we observed it achieving the same quality at lower cost — more efficient and precise at tasks like analyzing logs and traces, finding bugs, and proposing fixes.”

How Opus 4.7 Compares to the Competition

The AI model landscape in April 2026 is fiercely competitive. Here’s how Opus 4.7 stacks up against key rivals:

vs. GPT-5.4: Opus 4.7 leads on SWE-bench Pro (64.3% vs. 57.7%), OSWorld-Verified (78.0% vs. 75.0%), and Finance Agent v1.1 (64.4% vs. 61.5%). GPT-5.4 leads on BrowseComp (89.3% vs. 79.3%) and Terminal-Bench 2.0 (75.1% vs. 69.4%).
vs. Gemini 3.1 Pro: Opus 4.7 leads on SWE-bench Pro (64.3% vs. 54.2%), SWE-bench Verified (87.6% vs. 80.6%), and Finance Agent v1.1 (64.4% vs. 59.7%).
vs. Muse Spark: Muse Spark slightly edges Opus 4.7 on MCP-Atlas (78.3% vs. 77.3%), though Opus 4.7 leads on most other benchmarks.

The model also scored 54.7% on Humanity’s Last Exam (both with and without tools), placing it in a competitive range with the best publicly available models on this comprehensive knowledge benchmark.

Practical Recommendations for Developers

If you’re considering adopting Opus 4.7, here are key action items:

Retune your prompts. Opus 4.7 follows instructions more literally than previous models. Prompts written for earlier versions may produce unexpected results because Opus 4.7 won’t interpret them loosely or skip parts.
Start with high or xhigh effort. Anthropic recommends these levels for coding and agentic tasks. The new xhigh tier offers a sweet spot between reasoning quality and latency.
Use Task Budgets for long runs. The new API feature lets you control token spend across extended sessions, preventing runaway costs during autonomous work.
Leverage the higher resolution vision. If your application processes screenshots, diagrams, or dense visual content, the 3.75 MP support is a significant upgrade.
Test the /ultrareview command. If you’re a Claude Code user, the dedicated review session can catch bugs and design issues that might otherwise slip through.

Kay Zhu, Co-Founder and CTO of Genspark, emphasized one critical production concern: “For Genspark’s Super Agent, Claude Opus 4.7 nails the three production differentiators that matter most: loop resistance, consistency, and graceful error recovery. Loop resistance is the most critical. A model that loops indefinitely on 1 in 18 queries wastes compute and blocks users.”

The Bottom Line

Claude Opus 4.7 represents a thoughtful, measured step forward in AI capability — one that prioritizes real-world reliability and safety alongside raw performance. It outperforms its predecessor on every benchmark, leads several competitive comparisons, and introduces genuinely new capabilities in vision, autonomy, and agent workflows.

But its most significant feature may be what it doesn’t do: Opus 4.7 is not the most capable model Anthropic has built. That distinction belongs to Mythos Preview, which remains in limited internal use. Opus 4.7 is the bridge — a model powerful enough to deliver real value while serving as a testing ground for the safety systems that will eventually govern the release of even more capable AI.

For developers and organizations looking for the best publicly available AI model today, Opus 4.7 is a compelling choice. For the industry at large, it signals a maturing approach to AI development — one that recognizes that capability without guardrails is not progress, but risk.

“Personally, I love how it pushes back during technical discussions to help me make better decisions. It really feels like a better coworker.” — Michele Catasta, President, Replit

Whether you’re building autonomous coding agents, processing complex visual data, or simply looking for a more reliable AI assistant, Claude Opus 4.7 delivers a meaningful upgrade — with the reassuring knowledge that safety considerations shaped its design at every level.

Anthropic Releases Claude Opus 4.7: The Most Capable Public AI Model — With a Catch

Anthropic Releases Claude Opus 4.7: The Most Capable Public AI Model — With a Catch

What Makes Opus 4.7 a “Notable Improvement”

The Vision Breakthrough: 3x Higher Resolution

Long-Horizon Autonomy: AI That Doesn’t Give Up

The Mythos Preview Gap: Why Less Is More

New Agent Features: xhigh, Task Budgets, and Auto Mode

Safety: A Balanced Assessment

Pricing and Token Economics

How Opus 4.7 Compares to the Competition

Practical Recommendations for Developers

The Bottom Line

Silicon Valley Showdown: Why Did OpenAI and Google…

Anthropic Launches Code Review Tool: How Should Beginners Choose?

OpenAI Drama: What Sam Altman’s Profile Really Reveals

OpenAI and Google Employees Defend Anthropic: Pros and Cons of the DOD Lawsuit for Beginners

Anthropic’s Claude Cowork: The AI Agent That’s Changing How Everyone Works With Files

Google Meet’s AI Note-Taking Now Works for In-Person Meetings — Here’s What Changed

Leave a Reply Cancel reply

Anthropic Releases Claude Opus 4.7: The Most Capable Public AI Model — With a Catch

What Makes Opus 4.7 a “Notable Improvement”

The Vision Breakthrough: 3x Higher Resolution

Long-Horizon Autonomy: AI That Doesn’t Give Up

The Mythos Preview Gap: Why Less Is More

New Agent Features: xhigh, Task Budgets, and Auto Mode

Safety: A Balanced Assessment

Pricing and Token Economics

How Opus 4.7 Compares to the Competition

Practical Recommendations for Developers

The Bottom Line

Similar Posts

Leave a Reply Cancel reply