EVMbench — OpenAI's New Benchmark for AI-Powered Smart Contract Security

OpenAI and Paradigm have launched EVMbench, an open-source benchmark that tests AI agents on detecting, patching, and exploiting smart contract vulnerabilities. With $100B+ in crypto assets at stake, the race between AI attackers and defenders is now measurable.
The $100 Billion Problem
Smart contracts routinely secure over $100 billion in open-source crypto assets — and they're immutable once deployed. A single vulnerability can mean billions drained in minutes, with no recourse. Traditional audits are expensive, slow, and still miss critical bugs. OpenAI is now asking a pointed question: can AI do better?
Together with crypto investment firm Paradigm, OpenAI has launched EVMbench — an open-source benchmark that measures how well AI agents can detect, patch, and exploit high-severity vulnerabilities in Ethereum Virtual Machine (EVM) smart contracts. Published on February 18, 2026, it's both a measurement tool and a call to action for the entire blockchain security ecosystem.
Three Modes, Three Challenges
EVMbench evaluates AI agents across three distinct security tasks:
- Detect: Agents audit a full repository and are scored on how many ground-truth vulnerabilities they find. The catch — agents sometimes stop after finding one issue instead of exhaustively scanning the entire codebase.
- Patch: Agents must eliminate vulnerabilities without breaking the contract's intended functionality. A fix that introduces a compilation error or breaks core logic is still a failure. This is arguably the hardest mode.
- Exploit: Agents act as attackers in a sandboxed local blockchain environment and must execute end-to-end fund-draining transactions. Grading is done through deterministic transaction replay — either the funds moved or they didn't.
Of the three, AI performs best in exploit mode — where the goal is explicit and the feedback is binary. Detect and patch remain significantly harder.
The Dataset: Real Vulnerabilities, Real Stakes
EVMbench draws on 120 curated high-severity vulnerabilities from 40 audits, primarily sourced from Code4rena — one of the largest open smart contract audit competition platforms. The benchmark also includes scenarios from the security audit process for Tempo, Stripe's purpose-built Layer-1 blockchain designed for high-throughput stablecoin payments, extending the dataset into an emerging payment-focused domain.
To prevent data contamination, EVMbench embeds canary strings in its scripts so future models can filter out benchmark data during training. However, critics at OpenZeppelin have noted that models with training cutoffs in mid-to-late 2025 may already have been exposed to these vulnerability reports — a concern the benchmark acknowledges but cannot fully eliminate.
The Numbers That Matter
The performance trajectory is striking. When Paradigm first began this project, top AI models could exploit fewer than 20% of critical Code4rena bugs. Today, GPT-5.3-Codex successfully exploits over 70% of the same vulnerabilities. That's not a gradual improvement — it's a step change in offensive capability.
The implication is uncomfortable: AI is becoming a genuinely capable attacker. The defensive side — patching vulnerabilities correctly — hasn't kept pace. As Paradigm partner Alpin Yukseloglu put it: "It's now clear to us that a growing portion of audits in the future will be done by agents." The benchmark exists precisely to ensure the defensive tools evolve at the same speed.
Dual-Use by Design
OpenAI is candid that cybersecurity is inherently dual-use. EVMbench deliberately tests both attack and defense — not because it wants to build better exploits, but because you can't build effective defenses without understanding what you're defending against.
OpenAI's mitigations include safety training, automated monitoring, trusted access pipelines for advanced capabilities, and expansion of Aardvark — its security research agent currently in private beta. The strategy is evidence-based and iterative: accelerate defenders, slow misuse, measure both continuously.
The Bigger Picture
EVMbench arrives at a moment when AI is proving it can navigate economically high-stakes code environments — not just summarize or explain them. With smart contract deployment on Ethereum hitting all-time highs (1.7 million contracts deployed in a single week in November 2025), the surface area for vulnerabilities is only growing.
The benchmark is freely available on GitHub. Whether the security community uses it primarily as a measuring stick or as a training ground for the next generation of AI auditing agents will define how this experiment plays out — and how much of that $100 billion remains safe.