Claude Opus 4.6 Surpasses GPT-5.2 in Logic Tests

Anthropic's Claude Opus 4.6 outperforms GPT-5.2 in logic tests and introduces collaborative agent teams, enhancing its functionality for complex tasks.

The AI startup Anthropic has updated its flagship model, Claude Opus, to version 4.6. This neural network has improved its ability to plan actions, handle long tasks, and work more efficiently with large codebases.

The context window has been expanded to 1 million tokens, allowing for the analysis of massive documents and extended dialogues without losing logical coherence.

The updated algorithms are tailored for practical tasks such as financial analysis, research, and the creation of documents, spreadsheets, and presentations.

Opus 4.6 received the highest score in the Terminal-Bench 2.0 programming test and outperformed competitors in the challenging interdisciplinary benchmark, Humanity’s Last Exam, which assesses logical thinking.

In GDPval-AA, which evaluates reasoning and decision-making quality, the model surpassed OpenAI's GPT-5.2. The LLM also excelled in BrowseComp, which measures the ability to find hard-to-access information online.

Opus 4.6 effectively extracts data from extensive documents. Thanks to the expanded context window, the model tracks and captures subtle hidden details.

Agent Teams

A key innovation is the ability to create groups of agents for collaborative work. In this mode, multiple AI assistants operate in parallel and autonomously coordinate their efforts.

This tool is suitable for tasks that can be divided into independent parts and require analysis of large amounts of text.

Closed Loop

Anthropic stated, "We are creating Claude together with Claude." Developers write code using their own AI model, and each new product undergoes testing on internal company tasks before release.

The team found that Opus 4.6 pays more attention to the most complex parts of a task without additional instructions, quickly completes simple assignments, handles ambiguous problems better, and maintains efficiency over long durations.

“Opus 4.6 often thinks more deeply and carefully reviews its reasoning before making a decision. This leads to better outcomes in complex cases, but may increase costs and expenses in simpler ones,” the company noted.

Safety

An automated audit revealed that Opus 4.6 has a low propensity for undesirable behaviors such as deception, flattery, reinforcing user misconceptions, and facilitating wrongful actions.

The model demonstrates safety levels comparable to Opus 4.5.

To validate the model, the company conducted the most comprehensive series of evaluations, applying new testing methodologies for the first time and enhancing existing ones.

Availability and New Features

Claude Opus 4.6 is now available through a web interface, API, and major cloud platforms.

New features for developers include:

adaptive thinking — the neural network autonomously determines when to engage deep reasoning mode;
effort adjustment — four levels of work intensity are available, from low to maximum;
context compression — the tool automatically summarizes and replaces old context as conversations approach token limits.

Opus 4.6 works better with office tools like Excel and PowerPoint.

In January, Anthropic CEO Dario Amodei predicted the imminent arrival of AGI and job reductions.

Claude Opus 4.6 Surpasses GPT-5.2 in Logic Tests and Introduces Agent Teams

Agent Teams

Closed Loop

Safety

Availability and New Features