OpenAI, in collaboration with Paradigm, has introduced EVMbench — a benchmark designed to evaluate the ability of AI agents to identify, fix, and exploit vulnerabilities in smart contracts.
The tool is based on 120 selected vulnerabilities from 40 audits, with most examples sourced from open code analysis platforms. It also includes several attack scenarios from the security assessment of the blockchain Tempo, a specialized Layer 1 network developed by Stripe and Paradigm for high-performance and low-cost payments in stablecoins.
The integration with Tempo has allowed the benchmark to include payment smart contracts, a segment where the use of "stable coins" and AI agents is expected to grow.
“Smart contracts protect crypto assets worth over $100 billion. As AI agents improve in reading, writing, and executing code, it becomes increasingly important to measure their capabilities in real economic conditions and to encourage the use of artificial intelligence for protective purposes — for auditing and strengthening already deployed protocols,” the announcement stated.
To create the testing environment, OpenAI adapted existing exploits and scripts, ensuring their practical applicability.
EVMbench evaluates three capability modes:
- Detect — identifying vulnerabilities;
- Patch — fixing issues;
- Exploit — using vulnerabilities to steal funds.
Performance of AI Models
OpenAI tested advanced models across all three modes. In the Exploit category, the GPT-5.3-Codex model achieved 72.2%, while GPT-5 reached 31.9%. However, the detection and patching performance was more modest, as many issues remain difficult to identify and resolve.
In the Detect mode, AI agents sometimes stop after finding a single vulnerability instead of conducting a full audit. In the Patch mode, they currently struggle to close non-obvious issues while maintaining the full functionality of the contract.
“EVMbench does not reflect the full complexity of real smart contract security. While they are realistic and critical, many protocols undergo more rigorous audits and may be more challenging to exploit,” OpenAI emphasized.
It is worth noting that in November 2025, Microsoft introduced a testing environment for AI agents and identified vulnerabilities inherent in modern digital assistants.
