Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) can reliably distinguish logically impossible events (e.g., “the brake issued a traffic ticket to the car”) from merely improbable yet physically possible ones. Method: We systematically decouple event possibility, typicality, and contextual plausibility, constructing a controlled synthetic dataset of minimal-pair sentences. Using zero-shot probability estimation and rigorous statistical significance testing, we conduct cross-architectural evaluation on Llama 3, Gemma 2, Mistral NeMo, and others. Contribution/Results: All models exhibit significantly sub-chance accuracy (as low as 32%) under critical conditions, consistently preferring logically impossible over merely improbable events—a counterintuitive “impossibility preference.” This reveals a fundamental failure in LLMs’ probabilistic calibration and challenges the assumption that they implicitly encode sound world models. Our study establishes a novel benchmark and methodology for evaluating LLMs’ causal and physical commonsense reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models' ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as 'the car was given a parking ticket by the brake' than to merely unlikely sentences such as 'the car was given a parking ticket by the explorer'.

Problem

Research questions and friction points this paper is trying to address.

Assess if language models distinguish impossible vs improbable events

Evaluate model robustness in predicting event likelihood accurately

Test performance decline in assigning probabilities to impossible sentences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing possibility, typicality, contextual relatedness

Testing models like Llama 3, Gemma 2, Mistral

Comparing impossible vs improbable event probabilities

🔎 Similar Papers

Hallucination is Inevitable: An Innate Limitation of Large Language Models