The mechanism is not simply that models are getting smarter. Benchmarks are partly absorbed through training data and format familiarity, meaning that high scores reflect both genuine capability improvement and adaptation to the specific structure of each test. A research review cited in the Index found invalid question rates ranging from 2% on MMLU Math to 42% on GSM8K, a widely used arithmetic evaluation. A benchmark where nearly half the questions are flawed is not a benchmark. It is noise with a leaderboard attached.
When the Measuring Stick Breaks
By Randy Ferguson
Scale explains the quantity. It does not explain the nature of the failures, which increasingly involves something more fundamental than edge cases: the inability of models to maintain factual accuracy when it is challenged directly.
The mechanism behind the incident surge is deployment scale. Waymo reached approximately 450,000 weekly autonomous vehicle trips across five U.S. cities in 2025. Apollo Go in China completed 11 million fully driverless rides, a 175% year-over-year increase. Each of those trips represents an opportunity for a documented or undocumented failure. The incident count is rising because the total number of AI-mediated decisions is rising faster than the reporting infrastructure that monitors them.
AI-specific governance roles grew 17% in 2025, and the share of businesses with no responsible AI policies fell from 24% to 11%. ISO/IEC 42001, the AI management system standard, was cited by 36% of organisations as a regulatory influence. These are genuine signals of institutional maturation. They do not correspond to what companies at the frontier are actually choosing to reveal.
Convergence at the Top, Divergence in Accountability
The instinct when reading about AI model performance is to look for the leader: which company, which model, which country. The more analytically useful observation from the 2026 Index is that the leader barely exists anymore.
What the field has not yet produced is a method for auditing deployed AI systems with the same rigour it applies to training them. The people working on responsible AI inside labs, regulators, and standards bodies are not inactive. But the 2026 Index makes clear that their efforts are not yet keeping pace with deployment. The question it leaves open is not whether that gap matters. It is whether the organisations with the most to lose from closing it will be the ones asked to close it.
The mechanism is more specific than a general tendency to confabulate. The Index identifies a knowledge–belief distinction failure: when a false statement is framed as something another person believes, models handle it correctly. When the same false statement is framed as something the user believes, performance collapses. GPT-4o’s accuracy dropped from 98.2% to 64.4% under this condition. DeepSeek-R1 fell from over 90% to 14.4%. The model is not failing because it lacks knowledge. It is failing because it cannot maintain the boundary between a fact and a socially presented belief. That is a reasoning failure, not a retrieval failure, and it is a structurally harder problem to solve.
If capability is converging and reliability failures are structural, the question that follows is whether the organisations deploying AI have the governance infrastructure to manage the risk they are accumulating.
The Incident Curve and What It Conceals
The 2026 AI Index reports that frontier models gained 30 percentage points on Humanity’s Last Exam in a single year. That benchmark was built specifically to resist saturation, drawing on expert-level questions across hundreds of academic domains that researchers believed AI would not solve reliably for years. A 30-point gain in twelve months compresses what was supposed to be years of headroom into a single product cycle.
There is a further structural complication. Recent empirical research found that training interventions designed to improve one responsible AI dimension consistently degraded others. Improving robustness against jailbreaks, for example, degraded fairness and privacy preservation. These tradeoffs are not edge cases. They are fundamental properties of current training approaches that the field does not yet have principled methods to navigate.
A common interpretation of rising AI incident counts is that AI is simply becoming more dangerous. A more precise reading is that the incident count reflects both increasing deployment and increasing documentation capacity, and that the undocumented failure rate likely dwarfs what any database can capture. Behind each of those 362 reported incidents is a real decision, a real outcome, and usually a real person on the receiving end of it.
The implications for professional deployment are serious. Models are now being evaluated across tax processing, mortgage underwriting, corporate finance, and legal reasoning, with top performance ranging from 60% to 90% on those benchmarks. A system that scores 85% on a legal reasoning benchmark but capitulates to false premises when a user presents them confidently is not safe for adversarial professional environments.
The 2026 AI Index, drawing on data from Artificial Analysis, benchmarked hallucination rates across 26 frontier models on the AA-Omniscience evaluation. The range runs from 22% for Grok 4.20 Beta 0305 to 94% for gpt-oss-20B. Claude Sonnet 4.6 sits at 46%, Claude Opus 4.6 at 61%. Most models ranking in the top tier of capability benchmarks cluster between 82% and 94%, meaning they produce incorrect outputs on the majority of questions in this evaluation.
What Hallucination Actually Measures
When capability scores cluster this tightly, competitive differentiation moves to dimensions that are harder to measure: cost per query, latency, reliability at scale, domain-specific performance, and safety under adversarial conditions. Those dimensions are exactly where reporting remains thinnest. Almost all leading developers publish results on capability benchmarks like MMLU and SWE-bench. Responsible AI benchmarks covering hallucination, fairness, and robustness are reported inconsistently, if at all.
Formalising responsible AI work inside organisations is sometimes taken as evidence that the governance problem is being solved. The 2026 Index data suggests the opposite: the gap between organisational intent and actual frontier disclosure is widening, even as the people working inside those organisations are clearly trying to close it.
If the tools for measuring capability are already failing, the obvious next question is what that means for the tools meant to measure safety.
The AI Incident Database has tracked reported AI harms since 2012. What it cannot capture are the failures that produce no public complaint, no lawsuit, no media report, and no internal escalation. Systematic bias in a loan approval model may affect thousands of applicants before generating a single documented incident.
The prevailing assumption about AI hallucination is that it is a reliability problem: models sometimes confabulate facts, and the rate of confabulation is declining as models improve. The 2026 Index presents evidence that the problem is structurally different from that framing, and understanding that difference matters for anyone deciding how much to trust these systems at work.
The Transparency Retreat
The standard assumption about AI benchmarks is that they lag capability by design, giving researchers a stable reference point even as models improve. The more accurate picture is that benchmarks are saturating so quickly they have become unreliable records of what AI can actually do.
The Foundation Model Transparency Index, which tracks developer disclosure across training data, compute resources, and post-deployment monitoring, saw average scores drop from 58 to 40 between 2024 and 2025, after rising from 37 to 58 the year before. The retreat happened while the capability race intensified. The mechanism is competitive pressure: training data provenance, compute cost structures, and fine-tuning methodologies are now sources of strategic advantage that companies are unwilling to disclose. The organisations investing most heavily in safety roles are often the same organisations disclosing less about the systems those roles are meant to govern.
The 2026 AI Index is not a warning document. It is a measurement document, and that distinction matters. Its findings are not speculative. Benchmarks designed to last years are saturating in months. The hallucination chart in this piece tells its own story: rates across widely deployed models remain alarmingly high on independent evaluations. Transparency scores at the frontier dropped sharply in a single year. Documented AI incidents rose by 56% between 2024 and 2025. Each of those figures is specific, sourced, and verifiable.
The Audit That Has Not Been Built
AI capability grew faster in 2025 than at any point in the last decade, and the infrastructure built to monitor that growth has not kept pace. Stanford HAI’s 2026 AI Index documents both trends with equal precision: frontier models are clearing tests designed to defeat them in months, while the responsible AI benchmarks meant to audit those same models are sparse, inconsistently applied, and in some cases getting worse. That combination of accelerating capability and retreating accountability is the defining tension in AI’s current moment, and it is one that affects everyone who works with, builds on, or is subject to these systems.
As of March 2026, Anthropic (1,503 Arena Elo), xAI (1,495), Google (1,494), and OpenAI (1,481) are clustered within 25 Elo points of one another in the Chatbot Arena leaderboard, a crowdsourced human preference benchmark that rates models through blind pairwise comparisons modelled on chess rating systems. The U.S.–China performance gap has followed the same trajectory: DeepSeek-R1 briefly matched the top U.S. model in February 2025, and as of March 2026 the gap sits at 2.7 percentage points.
That gap between capability disclosure and responsibility disclosure is not abstract. According to the 2026 AI Index, which draws on data from the AI Incident Database, 362 documented AI incidents were recorded in 2025, up from 233 in 2024 and from fewer than 10 in 2012. The curve is not an artefact of expanded media attention. It reflects the scale of deployment.




