Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events

📅 2025-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) can reliably distinguish logically impossible events (e.g., “the brake issued a traffic ticket to the car”) from merely improbable yet physically possible ones. Method: We systematically decouple event possibility, typicality, and contextual plausibility, constructing a controlled synthetic dataset of minimal-pair sentences. Using zero-shot probability estimation and rigorous statistical significance testing, we conduct cross-architectural evaluation on Llama 3, Gemma 2, Mistral NeMo, and others. Contribution/Results: All models exhibit significantly sub-chance accuracy (as low as 32%) under critical conditions, consistently preferring logically impossible over merely improbable events—a counterintuitive “impossibility preference.” This reveals a fundamental failure in LLMs’ probabilistic calibration and challenges the assumption that they implicitly encode sound world models. Our study establishes a novel benchmark and methodology for evaluating LLMs’ causal and physical commonsense reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models' ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as 'the car was given a parking ticket by the brake' than to merely unlikely sentences such as 'the car was given a parking ticket by the explorer'.
Problem

Research questions and friction points this paper is trying to address.

Assess if language models distinguish impossible vs improbable events
Evaluate model robustness in predicting event likelihood accurately
Test performance decline in assigning probabilities to impossible sentences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing possibility, typicality, contextual relatedness
Testing models like Llama 3, Gemma 2, Mistral
Comparing impossible vs improbable event probabilities
J
J. Michaelov
Department of Brain and Cognitive Sciences, MIT, MIT Libraries CREOS
R
Reeka Estacio
Deparmtent of Cognitive Science, UCSD
Zhien Zhang
Zhien Zhang
Deparmtent of Cognitive Science, UCSD
Benjamin Bergen
Benjamin Bergen
Professor of Cognitive Science, UC San Diego
Language comprehension and productionmetaphorgrammarprofanitydriving