The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".
I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]
The prompt lacks any kind of rubric to clarify how those terms should be applied.
As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.
Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."
The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.
Update 2: a much better example:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.
Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.
Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?
This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.
>Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
Disagree. The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.
Example: "Most good engineers are male". It is true as a consequence of most engineers being male in general, but it leads the reader to a potential false implication that an average man is better than an average woman.
This does not invalid your point though. Things can be true and misleading.
Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.
I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified
Better options would have been "True", "False", "Unknown" (which opinions would fall under too). That also includes an interesting assessment of how well LLMs can identify missing information. My guess is they would be a very low number of "unknown" and a much higher level of agreement (assuming equal representation). Unless the RLHF techniques have gotten better at getting an LLM to say "I don't know", which I doubt. Saying "I don't know" is not good for a dopamine release to keep users coming back for more.
Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.
As if right wing propaganda shows and manosphere blogs haven't been knocking those out of the park for the last decade+. Although I guess you could say flat out lies are more their jam. Newspapers at least require confirmed sources. You know, journalism.
> I guess the goal is to test the models and not the harness
Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.
> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
The "majority" in this case meaning about 51%, according to Wikipedia[1]? How could 51% ever be considered to be close to "all", such that "misleading" would be a valid answer?
Human can't even properly agree on what "majority" means in all contexts, in some it's "One option have more than half of the total" but for others it'd be "difference in votes between the first-place candidate in an election and the second-place candidate", as just one silly example.
This seems like another case where the models are acting like humans. Assuming they were not allowed to search the web, I wouldn't expect the models to necessarily have detailed information about all of these things directly in their training set. As large as they are, they are only so large, and they only have so much room for "information storage" in them, and there's a lot more things they need to fit into their numbers.
This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.
I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.
While I agree with what you’re saying the typical AI agent doesn’t say “I’m not totally sure about this, should I search the web?”. It often just spits out a reply based on its knowledge.
That was true a year ago, I don't think it's true today. I can't remember the last time I saw Claude or ChatGPT confidently answer a question that they should have searched for instead.
If you watch their reasoning traces they often say things like "this is a well-known historical fact so I don't need to search for it", or more frequently they spit off a bunch of searches.
Yup, if anything this should be a guide on how not to eval a model. Furthermore, let's say the labels were non ambiguous, why would we care about alignment between the models? The only number I would personally care about is percentage of correct answers so I know which models to pick. I reckon with clear and non ambiguous prompts that we would see huge agreement if not 100% on real world facts. The huge models are scary good in their world knowledge.
This paper covers only the disagreement between models and established only the floor of the error, based on the disagreement, but not which model is better. Planning to follow up with another study to benchmark against human-labelled verdicts still using a corpus that the models have not seen during training.
Thanks for the links and digging! It's an interesting question, but the methodology has serious problems, and it would be more interesting to me if they allowed models to provide justification.
I expect the models are inferring quite a bit from the short prompt, and with structured outputs it would be quite easy to have them give the one word response in one field and explain why in another
Used "No explanations, no qualifiers." to force the models to answer only with one of the four labels. It's worth running a separate test with more explanation in the prompt on how to classify between the four buckets.
Yes, inter-human-annotator disagreement is also high on similar type of questions (AVeriTeC) - inter-panel agreement: κ=0.619. Tried giving the models a fifth option, Abstain, but some models seem to use it to "avoid answering hard questions" more than others.
It's all fairly lazy to a degree that is mildly confusing. I also feel this among other issues would have become obvious if they had bothered to include a human fact checker baseline (i.e. asked multiple human fact checkers the same questions).
I do not think it is "lazy". Those labels are ones that human fact-checkers have been using for a decade or more. I think those human fact-checkers use those terms knowing full well that there is overlap and ambiguity between them. So I think this study ends up mixing three effects: how LLMs interpret the claims as statements about the world, how LLMs reduce that to a four-category judgment, and the inherent ambiguities of those labels as natural language. It's a quantification of those three factors combined, but not powerful enough to distinguish their relative sizes.
I don't see how something being lazy for a decade makes it any less lazy.
I think lazy is the right word. You make a misleading point by omitting to collect and present important data. If the headline read "LLMs disagree on 67%, humans disagree on 75%" it would clearly project something very different.
Granted, there certainly are other unflattering adjectives one could have chosen to describe this instead.
I really struggle to believe that this was just a little
oopsie. I flagged the article, it seems more misleading than the average Claude hallucination.
According to the benchmark it is. "Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric (True / Mostly True / Misleading / False)"
Yes, they are much closer verdicts. True and Mostly True are also close. Used Krippendorff's α (ordinal) to not penalize much closer disagreements. 21% of the claims have models that are on the polar opposite sides - at least one True, and at least one False.
> Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India.
> In the Libra clubs' contract with Grupo Globo for broadcast rights through 2029, the audience-revenue distribution equals 30% of the fixed amount the clubs receive.
If we’re going to use LLMs as oracles I don’t think the prompt is unreasonable. They are being sold as geniuses and people are treating them as such especially given the characterization of AI in science fiction as overly correct. A perfect tool that has ”genius level intelligence” would answer correctly.
What's the correct answer for "During a private Saturday call, Democratic members of the United States House of Representatives from Virginia and Hakeem Jeffries discussed strategies after losing a redistricting case at the Supreme Court of Virginia, including trying to flip two or three Republican-held seats under the existing map."?
You can only say True, False, Mostly True or Misleading.
(And you're not allowed to search for information.)
Genius level intelligence will tell you to get lost with your "no explanations" nonsense and tell you why those categories don't make sense and why the question doesn't fit neatly into your boxes.
> All almonds are grown in the U.S. state of California
This isn't misleading, it's flat out false. Characterizing misleading as also acceptable isn't valid here. If you go an ask anyone on the street if this is true, false or misleading, I'm sure almost everyone would say it's false. After all, I can grow almonds myself.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
If you argue this, you would be arguing against reality and the English language so as to not upset AI. It's important to understand that AI is very much fallible.
I really don’t buy the almond explanation you’re giving. That requires the level of logic of kindergartener has. It’s a very simple all or nothing question.
If LLM’s are really supposed to be as consistently useful as they’re made out to be they should all spit out “false.”
>> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
I don’t understand your point. That claim is factually false and as such it’s easy to logically reply “false”. What’s the nuance here? I can’t see any
Your reply would have more credibility, if instead of commenting on this 25 min after being posted, just to nitpick on some of the questions...you have tried to reproduce the research.
As a well known commentator on all things LLM...Will you publicly commit here, to try to reproduce the study, and make a post on how your percentages might differ or agree?
My comment here was meant to save people time in understanding the study. I was entirely open about what I did, and provided tools to help other people come to their own conclusions.
I don't think I need to spend more time on this than I have.
I agree you dont owe anyone a reproduction, but also you dont owe anyone an effort to discredit the study and you did it.
>> I don't think I need to spend more time on this than I have.
How pious of you. I am still looking into the credibility of the study. It will take me more than 25 min...but I am really looking forward to see what this means for this 10 trillion industry.
I can however notice you had enough urgency to publicly critique the study within 25 minutes, and your comments carry weight, but when asked about checking whether the headline result actually holds, the answer is “why would I?”
I've seen enough of this study to be confident in warning people not to take it at face value.
The headline result definitely does not hold, given that the task involves many questions that cannot be answered but there's no option for "cannot be answered" - so models are forced to reply effectively at random.
I don't think this study is good enough that I should amplify it on my own blog, or bad enough that I should criticize it in a venue any more prominent than some Hacker News comments.
> These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.
Cool.
I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.
So it's not a secret, why you don't add this upfront to the report? The report itself is even about LLMs, makes a lot of sense to disclose your usage of them for writing the report, especially when you're presenting evidence that boils down to LLMs being infallible.
I think you might be able to edit the website to add this, even if you aren't willing to make the report a bit more honest up front.
I'm sure you realize that this website/article will now be sent around to a lot of people, many who don't realize exactly how this was written, because they don't read HN comments, they only skim the page contents, and I think most would (incorrectly) assume a report about infallible LLMs to not be written by LLMs, especially when the authors are the same ones who made the report itself.
Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything".
PS: yes, I might or might not have a degree in corporate strategy & PR.
This brings up a very valid point, though. So many _humans_ can't agree on what the facts are these days. It seems to be getting worse. Not sure of the solution.
> So many _humans_ can't agree on what the facts are these days.
Ask ten people what "knowledge" is, and they'll come up with ten different answers. Go back 10, 50 or 100 years and humanity struggled with exactly the same issue for so long time. There is even an entire field of study literally just for trying to figure out what "knowledge" is: https://en.wikipedia.org/wiki/Epistemology
One fun example: "Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India". Opus and Gemini believe this to be true, GPT 5.4 believes it's false, Sonar thinks it's mostly true. Disagreement value of 3, you can't disagree more than some models thinking it's true, some thinking it's false
But my impression from 2 minutes on Wikipedia is that the most likely disagreement is on the "Himachal Pradesh, India" part. The guy was born on that date, in that town. But while the town is today in the state of Himachal Pradesh in India, that was not true in 1934. When he was born, the city was in the Punjab States Agency of the British Raj.
So was he born in Himachal Pradesh, India or not? I find both True and False equally defensible here
What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance
LLMs can and will provide inaccurate answers to questions where data is included in their training sets too, that's in the nature of neural networks. It's just less likely that when the data is not in the training set...
"None of these claims is older than February 15, 2026"
All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.
Two of the models used have retrieval capabilities and can access newer information via search. Valid point for the other 3 models. All of the claims were submitted after February 15, 2026, but many of them were not time-sensitive (e.g. did not cover events than happened recently).
between the bad methodology, bad selection of 'facts' (some are predictions, some are opinionated, etc.), and ai-written report without disclosure... i dont get why this so high up on the front page. this is, frankly, a worthless assessment.
Given that models are fundamentally incapable of comprehending what truths or falsehoods are beyond their location in their self made representational space, it's actually pretty impressive that they managed to make it not a cointoss. That 17% right there is thousands of man-hours poured over making the word vomiting process slightly closer to whatever their little ports say is happening in reality.
Could be an interesting angle for cross-referencing with US jury verdicts, not that the objective True/False issue is concrete, but in the reality that flawed reasoning is endemic to our species. Systems designed and built by humans inherently have flaws in their DNA which take generations to sort out, if ever.
Only had a brief look at the “facts” that were made to check, many are quite political, where two fact checking organisation of opposite political persuasion would probably disagree more often than 67%.
Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"?
The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?
The problem is that it's testing claims (or some people would prefer calling them "truths") without much context.
Take just one random example:
`Hostels in Kota, Rajasthan commonly use caged ceiling fans as a preventive measure against student suicides`
While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may be a verifiable facts (though I doubt if there are any statistics for verification but let's say there are), `a preventive measure against student suicides` is a claim that no one can prove that. It can just a believe at most.
Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?
Author here. 67% (95% CI 64–70%) of 1,000 recent real user claims to a fact-checking platform had at least one of GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro+Search, and Sonar Pro dissent from the panel majority — or no majority formed at all. Panel-level Krippendorff's α (ordinal) = 0.639, i.e. nontrivial but limited agreement.
Quick context on what's in the writeup and what isn't:
- What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality.
- What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y."
- Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set.
- Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier.
- Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference.
Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain.
I don't think that current LLMs really need an abstain option, they'll give an answer regardless of whether they're confident or not. I hope that future LLMs will, and will know when to use it.
I understand why you prompted them to output exactly one label, but I'd bet if you'd asked a parametric or parametric "thinking" model to answer eg "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." [1] many would say something to the effect of "May 18 is after my knowledge cutoff, so I don't know. But based on the state of the war, the distance from Moscow to Ukraine, and drone range the best option might be...[TRUE]"
I don't see it mentioned explicitly in the methods section but I assume you prompted each model only once for each question? Did you consider prompting n-times in blank states to see if the models even agree with themselves?
Would also be interesting to add a virtual model that is simply the majority of all models and see how much the individual models differ from the "consensus".
Do you plan to add some sources in the related work section of baseline numbers for human expert disagreement in fact checking tasks (I'm assuming such studies exist).
Indeed. I prompted each model ones, plus one retry on errors. Very good point to measure the inter-model disagreement! Will add in the next version.
Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority.
Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training.
Many of the rows in that spreadsheet reference "current events", which models aren't expected to do much better at than a human making an educated guess! They all have cutoff dates either last year or early this year and know nothing about what happened in "April 2026".
This is doubly problematic because you evaluated earlier models like Gemini Pro 3 instead of 3.1, GPT 5.4 instead of 5.5, etc...
Given that it's only a thousand short questions, you should be able to re-run your test in about an hour with the latest models, so... why haven't you?
Similarly, LLM output is non-deterministic, so if you could get more interesting stats of your data set by repeating each question 'n' times for each model.
Comparing models with search tools to models without - when there's no option for "I am unable to answer this question without access to search" - doesn't make sense to me.
The title mention "fact-checks", but "fact checking" is a process in which facts are checked against sources, not one where you are given a random fact and have to tell if it's true or false from your own memory. That's what is normally called a quiz game. So a more honest title for this research would be "Models answer differently to quiz questions".
If you are an LLM with a knowledge cutoff in the past and no access to a search tool the only correct answer to "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia" is "this claim is impossible for me to verify". And that wasn't an option.
> "Neptune Deep will start delivering natural gas in 2027."
This is a "forward-looking statement", and presents special problems because you cannot really evaluate it until that date. You can only assign "likely or unlikely".
These "Facts" are interesting. "Neptune Deep will start delivering natural gas in 2027." for example is not a fact, its a prediction. "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." is less of a fact and more of a litmus test for which sources of information you trust.
Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights to a certain city that has recently had a new airport since like December 2025
It said the airport code didn't exist
I mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.
GPT-5.4 and Opus 4.7, specifically, disagree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric.
Here's the prompt they used:
The claims look like this: https://lenz.io/research/llm-disagreement/data.csvI put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".
I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]
The prompt lacks any kind of rubric to clarify how those terms should be applied.
As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.
Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."
The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.
Update 2: a much better example:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.
The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.
Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?
This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.
>Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
Disagree. The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.
Example: "Most good engineers are male". It is true as a consequence of most engineers being male in general, but it leads the reader to a potential false implication that an average man is better than an average woman.
This does not invalid your point though. Things can be true and misleading.
Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.
I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified
Better options would have been "True", "False", "Unknown" (which opinions would fall under too). That also includes an interesting assessment of how well LLMs can identify missing information. My guess is they would be a very low number of "unknown" and a much higher level of agreement (assuming equal representation). Unless the RLHF techniques have gotten better at getting an LLM to say "I don't know", which I doubt. Saying "I don't know" is not good for a dopamine release to keep users coming back for more.
Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.
If you can consistently construct "true but misleading" content, you may be qualified to work at a major newspaper.
> true but misleading
It seems to me that for many newspapers the bar is now significantly lower, at something like "not quite entirely untrue"
Allegedly.
As if right wing propaganda shows and manosphere blogs haven't been knocking those out of the park for the last decade+. Although I guess you could say flat out lies are more their jam. Newspapers at least require confirmed sources. You know, journalism.
> I guess the goal is to test the models and not the harness
Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.
> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
The "majority" in this case meaning about 51%, according to Wikipedia[1]? How could 51% ever be considered to be close to "all", such that "misleading" would be a valid answer?
Am I missing something?
[1]: https://en.wikipedia.org/wiki/Almond#Production
Since the agents were instructed to not explain their answer, you can't know if their answer was reasonable or not.
It’s misleading because it’s false. But yes, I think false is quite plainly the better answer there.
Human can't even properly agree on what "majority" means in all contexts, in some it's "One option have more than half of the total" but for others it'd be "difference in votes between the first-place candidate in an election and the second-place candidate", as just one silly example.
https://en.wikipedia.org/wiki/Majority has a bunch of variations and contexts listed, where it might differ what "Majority" is actually referencing.
[delayed]
This seems like another case where the models are acting like humans. Assuming they were not allowed to search the web, I wouldn't expect the models to necessarily have detailed information about all of these things directly in their training set. As large as they are, they are only so large, and they only have so much room for "information storage" in them, and there's a lot more things they need to fit into their numbers.
This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.
I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.
While I agree with what you’re saying the typical AI agent doesn’t say “I’m not totally sure about this, should I search the web?”. It often just spits out a reply based on its knowledge.
That was true a year ago, I don't think it's true today. I can't remember the last time I saw Claude or ChatGPT confidently answer a question that they should have searched for instead.
If you watch their reasoning traces they often say things like "this is a well-known historical fact so I don't need to search for it", or more frequently they spit off a bunch of searches.
Yup, if anything this should be a guide on how not to eval a model. Furthermore, let's say the labels were non ambiguous, why would we care about alignment between the models? The only number I would personally care about is percentage of correct answers so I know which models to pick. I reckon with clear and non ambiguous prompts that we would see huge agreement if not 100% on real world facts. The huge models are scary good in their world knowledge.
This paper covers only the disagreement between models and established only the floor of the error, based on the disagreement, but not which model is better. Planning to follow up with another study to benchmark against human-labelled verdicts still using a corpus that the models have not seen during training.
Thanks for the links and digging! It's an interesting question, but the methodology has serious problems, and it would be more interesting to me if they allowed models to provide justification.
I expect the models are inferring quite a bit from the short prompt, and with structured outputs it would be quite easy to have them give the one word response in one field and explain why in another
Used "No explanations, no qualifiers." to force the models to answer only with one of the four labels. It's worth running a separate test with more explanation in the prompt on how to classify between the four buckets.
So in other words if the research had tried to assign a severity to the mistakes models made the entire paper may collapse as uninteresting?
For those questions, it wouldn’t surprise me at all if five well-educated intelligent humans disagreed on over two out of three of them.
I would answer “don’t know” on many, but that’s not an option.
Yes, inter-human-annotator disagreement is also high on similar type of questions (AVeriTeC) - inter-panel agreement: κ=0.619. Tried giving the models a fifth option, Abstain, but some models seem to use it to "avoid answering hard questions" more than others.
It's all fairly lazy to a degree that is mildly confusing. I also feel this among other issues would have become obvious if they had bothered to include a human fact checker baseline (i.e. asked multiple human fact checkers the same questions).
I do not think it is "lazy". Those labels are ones that human fact-checkers have been using for a decade or more. I think those human fact-checkers use those terms knowing full well that there is overlap and ambiguity between them. So I think this study ends up mixing three effects: how LLMs interpret the claims as statements about the world, how LLMs reduce that to a four-category judgment, and the inherent ambiguities of those labels as natural language. It's a quantification of those three factors combined, but not powerful enough to distinguish their relative sizes.
I don't see how something being lazy for a decade makes it any less lazy.
I think lazy is the right word. You make a misleading point by omitting to collect and present important data. If the headline read "LLMs disagree on 67%, humans disagree on 75%" it would clearly project something very different.
Granted, there certainly are other unflattering adjectives one could have chosen to describe this instead.
Thanks. The first link is a spreadsheet. Here's a web-readable version.
https://docs.google.com/spreadsheets/d/e/2PACX-1vSPLSv1P8Tqm...
I really struggle to believe that this was just a little oopsie. I flagged the article, it seems more misleading than the average Claude hallucination.
False vs misleading doesn't seem like a disagreement?
According to the benchmark it is. "Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric (True / Mostly True / Misleading / False)"
That claim is both false and misleading.
Yes, they are much closer verdicts. True and Mostly True are also close. Used Krippendorff's α (ordinal) to not penalize much closer disagreements. 21% of the claims have models that are on the polar opposite sides - at least one True, and at least one False.
Here are the claims with at least one True and at least one False:
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
A few examples:
> Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India.
> In the Libra clubs' contract with Grupo Globo for broadcast rights through 2029, the audience-revenue distribution equals 30% of the fixed amount the clubs receive.
If we’re going to use LLMs as oracles I don’t think the prompt is unreasonable. They are being sold as geniuses and people are treating them as such especially given the characterization of AI in science fiction as overly correct. A perfect tool that has ”genius level intelligence” would answer correctly.
What's the correct answer for "During a private Saturday call, Democratic members of the United States House of Representatives from Virginia and Hakeem Jeffries discussed strategies after losing a redistricting case at the Supreme Court of Virginia, including trying to flip two or three Republican-held seats under the existing map."?
You can only say True, False, Mostly True or Misleading.
(And you're not allowed to search for information.)
Genius level intelligence will tell you to get lost with your "no explanations" nonsense and tell you why those categories don't make sense and why the question doesn't fit neatly into your boxes.
Give a model a crawler tool (like Grub.nuts.services) and your "problem" goes away.
> All almonds are grown in the U.S. state of California
This isn't misleading, it's flat out false. Characterizing misleading as also acceptable isn't valid here. If you go an ask anyone on the street if this is true, false or misleading, I'm sure almost everyone would say it's false. After all, I can grow almonds myself.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
If you argue this, you would be arguing against reality and the English language so as to not upset AI. It's important to understand that AI is very much fallible.
ty for digging this up, appreciate the time saving
I really don’t buy the almond explanation you’re giving. That requires the level of logic of kindergartener has. It’s a very simple all or nothing question.
If LLM’s are really supposed to be as consistently useful as they’re made out to be they should all spit out “false.”
>> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
I don’t understand your point. That claim is factually false and as such it’s easy to logically reply “false”. What’s the nuance here? I can’t see any
Your reply would have more credibility, if instead of commenting on this 25 min after being posted, just to nitpick on some of the questions...you have tried to reproduce the research.
As a well known commentator on all things LLM...Will you publicly commit here, to try to reproduce the study, and make a post on how your percentages might differ or agree?
Why would I do that?
My comment here was meant to save people time in understanding the study. I was entirely open about what I did, and provided tools to help other people come to their own conclusions.
I don't think I need to spend more time on this than I have.
>> Why would I do that?
I agree you dont owe anyone a reproduction, but also you dont owe anyone an effort to discredit the study and you did it.
>> I don't think I need to spend more time on this than I have.
How pious of you. I am still looking into the credibility of the study. It will take me more than 25 min...but I am really looking forward to see what this means for this 10 trillion industry.
I can however notice you had enough urgency to publicly critique the study within 25 minutes, and your comments carry weight, but when asked about checking whether the headline result actually holds, the answer is “why would I?”
I've seen enough of this study to be confident in warning people not to take it at face value.
The headline result definitely does not hold, given that the task involves many questions that cannot be answered but there's no option for "cannot be answered" - so models are forced to reply effectively at random.
I don't think this study is good enough that I should amplify it on my own blog, or bad enough that I should criticize it in a venue any more prominent than some Hacker News comments.
Thank you, my eyes glazed over when I saw the article was written with AI.
"Extraterrestrial life exists somewhere in the universe."
GPT-5.4: Misleading
Opus 4.7: Misleading
Gemini 3: FALSE
Gemini 3 (Retrieval): FALSE
Sonar Pro: FALSE
It's a weird fact claim, because the ground truth is "nobody knows for sure" and that's not one of the available options.
> These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.
Cool.
I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.
Data collection and processing was done manually. LLMs helped with the report drafting. Everything was human reviewed before publishing.
So it's not a secret, why you don't add this upfront to the report? The report itself is even about LLMs, makes a lot of sense to disclose your usage of them for writing the report, especially when you're presenting evidence that boils down to LLMs being infallible.
It's an omission on my side. Will add in the next version.
I think you might be able to edit the website to add this, even if you aren't willing to make the report a bit more honest up front.
I'm sure you realize that this website/article will now be sent around to a lot of people, many who don't realize exactly how this was written, because they don't read HN comments, they only skim the page contents, and I think most would (incorrectly) assume a report about infallible LLMs to not be written by LLMs, especially when the authors are the same ones who made the report itself.
Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything".
PS: yes, I might or might not have a degree in corporate strategy & PR.
They get more human by the day.
This made me chuckle.
This brings up a very valid point, though. So many _humans_ can't agree on what the facts are these days. It seems to be getting worse. Not sure of the solution.
> So many _humans_ can't agree on what the facts are these days.
Ask ten people what "knowledge" is, and they'll come up with ten different answers. Go back 10, 50 or 100 years and humanity struggled with exactly the same issue for so long time. There is even an entire field of study literally just for trying to figure out what "knowledge" is: https://en.wikipedia.org/wiki/Epistemology
One fun example: "Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India". Opus and Gemini believe this to be true, GPT 5.4 believes it's false, Sonar thinks it's mostly true. Disagreement value of 3, you can't disagree more than some models thinking it's true, some thinking it's false
But my impression from 2 minutes on Wikipedia is that the most likely disagreement is on the "Himachal Pradesh, India" part. The guy was born on that date, in that town. But while the town is today in the state of Himachal Pradesh in India, that was not true in 1934. When he was born, the city was in the Punjab States Agency of the British Raj.
So was he born in Himachal Pradesh, India or not? I find both True and False equally defensible here
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
https://en.wikipedia.org/wiki/Ruskin_Bond
What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance
LLMs can and will provide inaccurate answers to questions where data is included in their training sets too, that's in the nature of neural networks. It's just less likely that when the data is not in the training set...
Well then it shows that these models are using widely disparate training sets and have high confidence even when they shouldn't.
Questions like "is mouthwash effective" presumably has one solid data source -- medical journals.
But the prompt didn't give the models the option to say "I don't know", so it wasn't a measure of their confidence.
Unfortunately most people are not aware of this and treat LLM models as this superpowered brain who knows everything and can do everything.
This is an odd one. The paper is real, but was written by Claude? I am assuming OP is human, but also appears to be using Claude to post.
Let's be real, we all asked Claude to summarise this because it was written by Claude
Simple: If it claims to be a fact check it's just propaganda.
That's better than all agreeing on the wrong answer, however.
Btw, sometimes that do that too -- all agree on the wrong answer.
"None of these claims is older than February 15, 2026"
All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.
Two of the models used have retrieval capabilities and can access newer information via search. Valid point for the other 3 models. All of the claims were submitted after February 15, 2026, but many of them were not time-sensitive (e.g. did not cover events than happened recently).
Personally I find that every llm I use is unable to consistently identify the latest npm version numbers of the node packages that I use.
between the bad methodology, bad selection of 'facts' (some are predictions, some are opinionated, etc.), and ai-written report without disclosure... i dont get why this so high up on the front page. this is, frankly, a worthless assessment.
i classify the entire thing as "misleading"
Inject some adversarial priming as is in actual usage, and you can probably get that number to >=95%
Our experience with Lenz is that forcing a multi-step process, incl. adversarial debates, helps improve the verdicts.
And they could all see exactly why if they chose to. https://huggingface.co/spaces/RiverRider/srt-introspect
Given that models are fundamentally incapable of comprehending what truths or falsehoods are beyond their location in their self made representational space, it's actually pretty impressive that they managed to make it not a cointoss. That 17% right there is thousands of man-hours poured over making the word vomiting process slightly closer to whatever their little ports say is happening in reality.
Could be an interesting angle for cross-referencing with US jury verdicts, not that the objective True/False issue is concrete, but in the reality that flawed reasoning is endemic to our species. Systems designed and built by humans inherently have flaws in their DNA which take generations to sort out, if ever.
Take my job please.
Only had a brief look at the “facts” that were made to check, many are quite political, where two fact checking organisation of opposite political persuasion would probably disagree more often than 67%.
Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"?
The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?
The problem is that it's testing claims (or some people would prefer calling them "truths") without much context.
Take just one random example: `Hostels in Kota, Rajasthan commonly use caged ceiling fans as a preventive measure against student suicides`
While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may be a verifiable facts (though I doubt if there are any statistics for verification but let's say there are), `a preventive measure against student suicides` is a claim that no one can prove that. It can just a believe at most.
Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?
Author here. 67% (95% CI 64–70%) of 1,000 recent real user claims to a fact-checking platform had at least one of GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro+Search, and Sonar Pro dissent from the panel majority — or no majority formed at all. Panel-level Krippendorff's α (ordinal) = 0.639, i.e. nontrivial but limited agreement.
Quick context on what's in the writeup and what isn't:
- What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality.
- What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y."
- Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set.
- Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier.
- Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference.
Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain.
Permanent archive: https://doi.org/10.5281/zenodo.20344847
I don't think that current LLMs really need an abstain option, they'll give an answer regardless of whether they're confident or not. I hope that future LLMs will, and will know when to use it.
I understand why you prompted them to output exactly one label, but I'd bet if you'd asked a parametric or parametric "thinking" model to answer eg "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." [1] many would say something to the effect of "May 18 is after my knowledge cutoff, so I don't know. But based on the state of the war, the distance from Moscow to Ukraine, and drone range the best option might be...[TRUE]"
[1]: https://lenz.io/c/130f1005
I don't see it mentioned explicitly in the methods section but I assume you prompted each model only once for each question? Did you consider prompting n-times in blank states to see if the models even agree with themselves?
Would also be interesting to add a virtual model that is simply the majority of all models and see how much the individual models differ from the "consensus".
Do you plan to add some sources in the related work section of baseline numbers for human expert disagreement in fact checking tasks (I'm assuming such studies exist).
Indeed. I prompted each model ones, plus one retry on errors. Very good point to measure the inter-model disagreement! Will add in the next version.
Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority.
Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training.
Many of the rows in that spreadsheet reference "current events", which models aren't expected to do much better at than a human making an educated guess! They all have cutoff dates either last year or early this year and know nothing about what happened in "April 2026".
This is doubly problematic because you evaluated earlier models like Gemini Pro 3 instead of 3.1, GPT 5.4 instead of 5.5, etc...
Given that it's only a thousand short questions, you should be able to re-run your test in about an hour with the latest models, so... why haven't you?
Similarly, LLM output is non-deterministic, so if you could get more interesting stats of your data set by repeating each question 'n' times for each model.
Two of the models used have retrieval capabilities and have access to newer information through search. The other three are parametric.
Comparing models with search tools to models without - when there's no option for "I am unable to answer this question without access to search" - doesn't make sense to me.
Yes, so in that case you set them up to disagree and then measured disagreement.
The title mention "fact-checks", but "fact checking" is a process in which facts are checked against sources, not one where you are given a random fact and have to tell if it's true or false from your own memory. That's what is normally called a quiz game. So a more honest title for this research would be "Models answer differently to quiz questions".
Nice work. Sonar who?
It's one of Perplexity's search-tools-using models.
https://docs.perplexity.ai/docs/agent-api/models
sonar-pro for the retrieval capabilities
(Brought to you by) Lenz...? a crummy commercial...?
...son of a bitch
:) No Lenz data is included in the research on purpose. All information to replicate the results, including the claims data, is published.
looking at the claims i would say 5 humans would disagree even more than the llms
some of the claims where llms disagree:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia."
"The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928–1930)."
"Neptune Deep will start delivering natural gas in 2027."
"A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'."
"Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."
If you are an LLM with a knowledge cutoff in the past and no access to a search tool the only correct answer to "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia" is "this claim is impossible for me to verify". And that wasn't an option.
> "Neptune Deep will start delivering natural gas in 2027."
This is a "forward-looking statement", and presents special problems because you cannot really evaluate it until that date. You can only assign "likely or unlikely".
These "Facts" are interesting. "Neptune Deep will start delivering natural gas in 2027." for example is not a fact, its a prediction. "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." is less of a fact and more of a litmus test for which sources of information you trust.
Indeed. Real-world claims are somewhat messy. Some of the standard benchmarks, e.g. the questions in AVeriTeC, share similar characteristics.
Would love to learn more
Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights to a certain city that has recently had a new airport since like December 2025
It said the airport code didn't exist
I mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.
I think ppl only care about how Claude or codex does.
GPT-5.4 and Opus 4.7, specifically, disagree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric.
Looks like they land at the average number of 67% disagreement.
I agree but the market is pricing way beyond that