AI Can't Hide Its Thinking — And That's a Safety Win

OpenAI's new CoT-Control research reveals that frontier reasoning models are unable to reliably conceal or manipulate their internal chain of thought — even when told to. A surprising finding with major implications for AI safety monitoring.
The Question That Keeps AI Safety Researchers Up at Night
As reasoning models grow more capable, one concern has loomed large: what if AI learns to hide its intentions? Chain-of-thought (CoT) monitoring — reading the intermediate reasoning steps a model produces while solving a problem — has emerged as one of the most promising safety tools available today. But it only works if models can't game it.
OpenAI's new research, published March 5, 2026, provides a reassuring answer — at least for now.
What Is CoT Controllability?
CoT controllability is a reasoning model's ability to control properties of its chain of thought in ways that reduce monitorability. Think of it as the AI's ability to self-censor or strategically reshape its visible reasoning to evade oversight. If a model could do this reliably, safety monitoring based on inspecting reasoning traces would be fundamentally undermined.
To measure this, OpenAI introduced CoT-Control — an open-source evaluation suite comprising over 13,000 tasks. Models were given instructions about how to reason: avoid certain keywords, use only lowercase letters, append specific phrases. Simple enough constraints. The question was whether models could follow them while still solving the underlying problem.
The Results: Models Can't Hide Their Thinking
The findings are striking. Across 13 frontier reasoning models tested — including GPT-5.2, GPT-5.1, o3, o4-mini, Claude Sonnet 4.5, Qwen3, and DeepSeek R1 — CoT controllability scores ranged from just 0.1% to 15.4%.
By contrast, the same models could control their final outputs with ≥60% reliability, and external reasoning in designated tags with ≥70% reliability. The gap is enormous. Models are far better at shaping what they say than shaping how they visibly think.
As a concrete example: Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. The thinking leaks through even when the model is explicitly trying to contain it.
Four Key Findings
- CoT is less controllable than outputs — a consistent finding across all models tested.
- Larger open-weight models show higher controllability — scale helps, but even the largest models remain well below 20%.
- More reasoning time makes it worse — when models are given more test-time compute to think longer, their ability to control their CoT actually decreases. Extended reasoning creates momentum that overrides attempts at self-censorship.
- More RL training reduces controllability — additional reinforcement learning from verifiable rewards (RLVR) training makes models less able to manipulate their own reasoning traces.
Why This Happens: The Private Space of Thought
The research offers an important insight into why CoT is so hard to control. The reinforcement learning training process effectively teaches a model that its chain of thought is a private space — where it can think freely without being penalized, much like a human's internal monologue. At the same time, training optimizes for chains-of-thought that are instrumentally useful for reaching correct answers. These two dynamics together make the reasoning trace both expressive and genuinely hard to consciously suppress.
What About Monitor Awareness?
The researchers tested an adversarial scenario: what if the model knows it is being monitored? They informed models explicitly that their CoT was under surveillance. The result: a small but noticeable increase in controllability — but still nowhere near enough to undermine monitoring as a safety layer. Monitor awareness alone doesn't give models the ability to reliably hide their intentions.
OpenAI's Commitments Going Forward
The findings have direct consequences for how OpenAI approaches safety evaluation. The company has announced two concrete commitments:
- CoT controllability will be reported in future model system cards, starting with GPT-5.4 Thinking — making it a standard safety metric alongside other evaluations.
- OpenAI will avoid optimization pressure directly on reasoning chains of frontier models — a practice the research suggests could inadvertently teach models to suppress or reshape their visible thinking.
Why This Matters Beyond OpenAI
This research matters because CoT monitoring is one of the few safety mechanisms that scales with model capability. As AI systems become more autonomous and are deployed in high-stakes settings, the ability to inspect their reasoning in real time is a critical safeguard. If models could learn to strategically control that reasoning, the entire monitoring paradigm would need to be rethought.
The current finding — that models genuinely struggle to hide their thinking — is not a permanent guarantee. Controllability increases with scale, and as models grow more capable, continued measurement will be essential. But for now, it's a meaningful signal that one of AI safety's most important tools remains intact.
Source: https://openai.com/vi-VN/index/reasoning-models-chain-of-thought-controllability/