Summary

  • Five leading AI models disagreed on 67% of 1,000 real-world fact-check claims.
  • Consensus was reached on merely 328 claims.
  • With a Krippendorff's alpha of 0.639, the models fall short of the 0.8 reliability benchmark.

When querying five of the most sophisticated AI systems regarding the veracity of statements, it turns out that in approximately two-thirds of the cases, at least one model provided a conflicting answer. This conclusion stems from a recent study conducted by researcher Kosta Jordanov at Lenz Research.

The investigation involved the AI models GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro, which were tasked with evaluating the same set of 1,000 fact-check claims submitted by users. Each model was required to categorize the claims as true, mostly true, misleading, or false.

Out of the 1,000 claims analyzed, at least one model differed from the majority in 672 instances. Notably, in 34% of those disagreements, one model labeled a claim as true while another deemed it false.

The study highlights that these claims were not standard benchmark items with public answer keys; they were real submissions intended for verification on a fact-checking platform. “Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model’s verdict is label-inconsistent under this 4-bucket rubric,” the research states.

Previous investigations into AI hallucinations have revealed that chatbots can fabricate information. However, this study presents a different challenge; the models do not necessarily create false information but struggle to reach consensus on basic factual assessments of the same claims.

The methodology used in this study makes it difficult for AI developers to dismiss the results. Rather than relying on standard test sets—often appearing in training data—the researchers utilized claims sourced from actual users on Lenz’s fact-checking platform. “Most of these claims are unlikely to appear in any training corpus with a gold label attached—there’s no canonical answer key to pattern-match against, no benchmark leaderboard to anchor to,” the paper emphasizes.

The study employed Krippendorff’s alpha to measure agreement, yielding a score of 0.639, where 1.0 indicates perfect agreement and 0 represents random chance. This score suggests “nontrivial but limited agreement.” The models’ verdicts are structured rather than random, but not consistent enough to consider the panel as a single interchangeable judge, according to the researchers. Generally, a score below 0.8 is regarded as weak.

When all five models did reach an agreement—occurring on only 328 out of 1,000 claims—the consensus rarely classified any claim as misleading or mostly true. Just four claims were unanimously deemed “misleading,” and none were labeled “mostly true.”

The researchers provided examples of claims where the AI models exhibited significant divergence. For instance, regarding the statement, "The World Bank's active portfolio in Nigeria stands at over $16.4 billion as of 2025," ChatGPT 5.4 rated it as "mostly true," while Gemini 3 Pro deemed it "false," and its counterpart Gemini 3 Pro + Search classified it as "misleading."

In another case, the claim, "Donald Trump said that an attack on Iran was postponed at the request of Gulf Allies," received varied evaluations: GPT-5.4 rated it false, Claude Opus 4.7 labeled it mostly true, while Gemini 3 Pro marked it as false, and Gemini 3 Pro + Search rated it true.

The researchers concluded that the models converge on clear verdicts, but their consensus fractures in the middle ground. Unanimity is typically found only at the extremes, indicating a claim is either definitively true or definitively false.

This issue is particularly significant as more individuals are relying on AI systems for fact-checking. When you input a claim from a news source into ChatGPT, Claude, or Gemini, it’s possible to receive three differing responses. Which response should one trust?

AI companies frequently assert that their models are improving in accuracy, often presenting benchmark scores that reflect gradual enhancement. However, the Lenz study evaluated these models against the complex and ambiguous claims that real people actually debate—and found that the models themselves often disagree.

The paper stresses that “A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness.”

A deeper issue arises from the findings; when models disagree, at least one must be incorrect—the study refers to a model’s verdict as “label-inconsistent under this 4-bucket rubric.” There exists no tie-breaker mechanism or appeals process. Recent reports on AI reliability have raised similar concerns.

Among the 328 claims where all five models concurred, none were labeled unanimously as "mostly true," indicating a complete absence of nuance. If AI models can only achieve consensus at the extremes, can they be relied upon as effective fact-checkers?

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.