Summary
- A team from Huawei and three academic institutions has introduced Claw-Anything, a new benchmark designed to assess AI agents in personal-assistant roles.
- OpenAI’s GPT-5.5 achieved a mere 34.5% on the pass@1 metric, significantly lower than its performance on other benchmarks, indicating potential flaws in current evaluation methods.
- The researchers also unveiled an automated data pipeline that generated 2,000 training scenarios, and fine-tuning an open-weight model on this data enhanced task success rates by 23.7%.
The concept behind AI personal assistants has remained consistent: grant the agent access to your digital life, and it will manage everything for you—emails, calendars, notes, and devices. Your AI comprehends your needs, takes action, and allows you to rest.
A research collaboration involving Huawei Technologies, Beijing Institute of Technology, Peking University, and the Chinese Academy of Sciences has now developed a benchmark to test the validity of this claim. The findings suggest that it may not hold true.
The Claw-Anything benchmark evaluates AI agents on three fronts: long-term event streams simulating over three months of user activity, an average of 10.1 interdependent backend services per task, and the ability to interact across both CLI Linux and GUI Android environments.
Each task presents an average context of 191,700 words, a stark contrast to the typical range of 1,700 to 12,000 words seen in most benchmarks. This discrepancy highlights a fundamental difference between realistic scenarios and overly specific standardized tests.
AI's understanding is lacking
Scoring is based on the pass@1 metric, which measures the likelihood that an agent will successfully complete a task on its first attempt, without retries. For instance, a task might require the agent to check a price alert for a product discovered weeks prior, find a related appointment in the user’s calendar, and execute both actions on a mobile device. Another task could involve gathering recent work from notes, email threads, and Slack to create a presentation from scratch.
These are realistic requests that users might make of their assistants, yet AI struggles with them. GPT-5.5, as previously covered by Decrypt, is OpenAI's leading model, designed for long-term, agentic tasks, yet it only scored 34.5%.
The Claw-Anything paper states, "Current models remain unreliable even when provided with broader access to the user's digital environment." Several models that performed well on other benchmarks saw their scores decline significantly.
This benchmark also evaluates proactive assistance separately, which involves the agent identifying a need and acting without explicit prompts. Most existing benchmarks do not assess this aspect. Claw-Anything does, revealing a significant disparity: agents scored 25.9% on reactive tasks and only 6.7% on proactive tasks.
Why traditional benchmarks fall short
The researchers argue that existing benchmarks treat AI agents as mere task solvers in an ideal environment. In contrast, Claw-Anything simulates personal assistants navigating the complexities of real life—irrelevant events, conflicting information, and accumulated noise over months. The agent must discern what is pertinent before it can perform effectively.
The results from the ablation studies highlight the importance of multi-service dependencies. When tools necessary for cross-service tasks were excluded, success rates plummeted to nearly zero, as many tasks require agents to gather information and act across various backends rather than being confined to a single one.
This issue is not entirely new in AI evaluation. Earlier this year, OpenAI acknowledged that the SWE-bench was compromised after scores fell from around 70% to 23% on a more reliable version, which was a matter of data integrity. The current issue, however, raises more fundamental questions about whether the benchmarks are assessing the right criteria.
On a positive note, the research team released the pipeline that created the benchmark alongside 2,000 training environments. By fine-tuning Qwen3.5-27B on 1,500 successful agent trajectories, they increased the pass@1 score by 23.7%, surpassing several closed-source models in the rankings, including Claude Sonnet.
The researchers point to cross-service coordination as the main challenge remaining for the field. The dataset is available on Hugging Face, and the code can be found on GitHub.
