OpenAI's First Proof — When AI Takes on Research-Level Mathematics

OpenAI entered the First Proof challenge — the toughest math test yet for AI — and submitted proofs for 10 research-level problems. At least 5 are believed correct. The results are impressive, nuanced, and a sign of where AI reasoning is heading.
The Hardest Math Test AI Has Ever Faced
On February 5, 2026, eleven of the world's leading mathematicians — from Harvard, Stanford, Yale, MIT, EPFL, and UC Berkeley — set 10 "lemmas" for AI systems to solve in one week. Unlike math olympiads or multiple-choice benchmarks, First Proof demands something qualitatively harder: complete, end-to-end mathematical proofs in specialized research domains, each requiring expert review to verify. Several of the problems had remained unsolved by anyone for years before the authors themselves found solutions.
OpenAI submitted its proof attempts on Valentine's Day, February 14, 2026. Based on feedback from the mathematical community, at least five of the model's attempts — problems 4, 5, 6, 9, and 10 — are believed to have a high likelihood of being correct. That's not a perfect score. But it's a result that would have seemed implausible just two years ago.
What Made This Different
Most AI math benchmarks test whether a model can arrive at the right answer. First Proof tests something harder: whether a model can construct a complete, rigorous argument that a human mathematician would accept as a proof. The problems were written by active researchers, many of whom are world experts in their respective fields. Verifying correctness is itself a non-trivial task — the First Proof organizers noted it would take a strong academic department roughly a week just to check the submissions.
For certain problems, OpenAI selected the strongest attempt from multiple model iterations based on human judgment, and facilitated back-and-forth between the model and ChatGPT to assist with verification, formatting, and stylistic refinements. OpenAI acknowledged this process was rapid and not as methodical as a fully controlled evaluation — a candid admission that reflects both the compressed timeline and the genuine difficulty of the challenge.
A Trail of Milestones Leading Here
- July 2025: OpenAI's general-purpose reasoning model achieved gold medal-level performance at the International Mathematical Olympiad, scoring 35/42 points — matching Google DeepMind's model.
- November 2025: OpenAI published early experiments where GPT-5 helped researchers make concrete progress across math, physics, biology, and other fields, alongside honest assessments of limitations.
- Early 2026: GPT-5.2 proposed a candidate expression for a gluon-amplitude formula in physics — which was then formally proved by an internal model and verified by the authors.
- February 2026: First Proof submissions — the most direct test yet of research-grade mathematical reasoning.
The Benchmarks Behind the Claim
On GPQA Diamond — a graduate-level benchmark of physics, chemistry, and biology questions designed to be "Google-proof" — GPT-5.2 Pro achieves 93.2%, with GPT-5.2 Thinking close behind at 92.4%. On FrontierMath (Tier 1–3), a private benchmark of expert-level mathematics, GPT-5.2 Thinking set a new state of the art by solving 40.3% of problems — a domain where top human mathematicians struggle to reach similar coverage.
The Honest Verdict: Impressive, But Not a Replacement
Scientific American's assessment captured the nuance well: none of the AI systems came close to solving all 10 problems, and "the verdict is in — AI is not about to replace mathematicians." Questions about human involvement also arose, with several community submissions appearing to involve varying degrees of human input, which First Proof's rules formally prohibit. "Once there's humans involved, how do we judge how much is human and how much is AI?" asked Lauren Williams of Harvard, one of the challenge's organizers.
OpenAI's own framing is consistent: the goal isn't for AI to replace mathematical creativity, but to accelerate it. Solving 5 out of 10 research-level problems — even with some human-in-the-loop assistance — is not the endgame. It's a data point in an ongoing experiment about where the ceiling actually is.
What's Next: First Proof Round 2
The First Proof organizers have already announced a second batch, designed as a more rigorous formal benchmark running from March to June 2026. The rules will be stricter: one shot per question, no additional interaction, blind peer review by human mathematicians, and standardized grading criteria. It will also include a public community experimentation round. OpenAI has expressed eagerness to collaborate on a more controlled evaluation framework. The next round will tell us whether February's results were a ceiling — or a floor.