AI
[AD] Leaderboard 728×90 / 320×50
Back to feed
OpenAI6 min read

SWE-bench Verified Is Dead — Why OpenAI Pulled the Plug on Its Own Benchmark

By AI Guide News·Monday, February 23, 2026
SWE-bench Verified Is Dead — Why OpenAI Pulled the Plug on Its Own Benchmark

OpenAI has stopped reporting SWE-bench Verified scores and is urging the AI industry to do the same. The benchmark that once defined coding AI progress is now officially broken: 59.4% of audited test cases are flawed, and every major frontier model shows confirmed training data contamination.

[AD] Rectangle 300×250 / In-article

When a Benchmark Outlives Its Usefulness

Since its release in August 2024, SWE-bench Verified became the gold standard for measuring how well AI models can autonomously fix real software bugs. Every major model release cited it. Every AI lab chased it. Scores climbing from 74.9% to 80.9% over six months felt like meaningful progress. Then OpenAI looked more carefully — and what they found made continuing to use it indefensible.

OpenAI has officially stopped reporting SWE-bench Verified scores and is recommending the entire industry do the same. The benchmark is not just saturated — it is broken in two distinct and serious ways.

Problem #1 — The Tests Are Wrong

OpenAI audited 138 problems that their most capable model, o3, consistently failed across 64 independent runs. Each problem was reviewed by at least six experienced software engineers. The result was damning: 59.4% of the audited problems contain material flaws in their test design or problem description — making them extremely difficult or impossible to solve correctly, even for the best model or human engineer.

Two categories of broken tests emerged:

  • Narrow tests (35.5%): Test cases that enforce a specific implementation approach, rejecting functionally correct solutions that solve the problem differently.
  • Wide tests (18.8%): Test cases that check functionality never described in the original problem statement — penalizing models for not reading the mind of the test author.

In other words, when frontier models fail SWE-bench Verified tasks, it is often not because they got the answer wrong — it is because the test was written wrong. That fundamentally breaks the benchmark's ability to measure what it claims to measure.

Problem #2 — The Data Is Contaminated

The second issue is arguably more alarming. SWE-bench problems are sourced from open-source GitHub repositories — the same repositories that large frontier models are trained on. OpenAI's analysis found that every frontier model tested could reproduce the exact human-written bug fix used as the ground-truth reference solution, and verbatim problem details from the benchmark.

This is not theoretical contamination risk — it is confirmed exposure. Models that have seen benchmark solutions during training score higher not because they are more capable, but because they effectively have the answer key. The analogy is precise: it is like sharing exam problems and solutions with students before the test. The affected frontier models confirmed in OpenAI's analysis include GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash — essentially every top model currently being compared on coding benchmarks.

What Comes Next: SWE-bench Pro

OpenAI's recommendation is to transition to SWE-bench Pro, which shows significantly less contamination and uses a standardized evaluation harness that prevents agent scaffolding tricks from inflating scores. The tradeoff is immediate and visible: expect reported scores to drop by 20-30 percentage points as the industry makes the switch. That is not a regression in capability — it is a correction in measurement.

Longer term, OpenAI is calling on the broader research community to invest in privately authored, uncontaminated benchmarks — evaluations where the problems have never been publicly available, making training data exposure structurally impossible.

The Uncomfortable Truth About Benchmark Culture

This episode exposes a systemic problem in how the AI industry tracks progress. Benchmarks become standards. Standards attract optimization pressure. Optimization pressure leads to overfitting — either through deliberate training on benchmark data or through the inevitable contamination of public datasets. Once a benchmark goes public at scale, its signal degrades.

OpenAI's decision to publicly retire SWE-bench Verified — a benchmark they helped create and that reflected well on their models — is a meaningful act of intellectual honesty. The harder question is whether the rest of the industry will follow, or whether labs will continue citing Verified scores when they are favorable and quietly moving on when they are not.

For developers and technical decision-makers evaluating AI coding agents: treat any SWE-bench Verified score reported after mid-2025 with skepticism. The number may be real. The benchmark it is measured on is not.

openaiswe-benchbenchmarkcoding-aievaluationgpt-5contamination
[AD] Leaderboard 728×90 / end of article