Cybersecurity firm OpenZeppelin audited OpenAI's new AI benchmark, EVMbench. Experts identified methodological errors and data "contamination."

https://t.co/yW00RmRBZQ

— OpenZeppelin (@OpenZeppelin) March 2, 2026

ChatGPT developer launched EVMbench in mid-February in partnership with investment fund Paradigm to assess the ability of AI agents to find, fix, and exploit vulnerabilities in smart contracts.

OpenZeppelin specialists welcomed the initiative but decided to evaluate the development against the same standards as the protocols it aims to protect (including Aave, Lido, and Uniswap).

Key Issues

The main problem lies in the "contamination" of training data. EVMbench is based on a collection of 120 vulnerabilities identified during audits from 2024 to 2025.

However, the leading models tested have knowledge cutoffs up to August 2025. They may have "recalled" information about these vulnerabilities from their training data. Even with the internet disabled, this raises doubts about the experiment's integrity: it remains unclear whether the AI can identify genuinely new threats.

OpenZeppelin also pointed out actual errors in the EVMbench dataset. At least four vulnerabilities classified as "high risk" were found to be non-functional. Meanwhile, AI agents received correct scores for supposedly identifying these issues.

"This is not a subjective disagreement about severity; these are cases where the described attack simply does not work," the experts emphasized.

Experts confirmed that artificial intelligence will play a crucial role in the future of blockchain security. However, they cautioned that haste in implementation should not compromise the quality of data and testing.

"The question is not whether AI will change smart contract security—it will. The question is whether the benchmarks and data we build these tools on will meet the same standards as the contracts they are meant to protect," OpenZeppelin concluded.

Recall that in November, Microsoft experts introduced a testing environment for AI agents and identified vulnerabilities inherent in modern digital assistants.