Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates the probabilistic reasoning capabilities of large language models (LLMs) over explicit discrete probability distributions, focusing on three core tasks: pattern recognition, maximum likelihood estimation, and sample generation. We introduce a standardized evaluation framework via prompt engineering to empirically assess LLMs’ performance in frequency analysis, marginal inference, and conditional generation—marking the first such comprehensive study. Key findings include: (1) Performance improves significantly with model scale but remains highly sensitive to symbolic representation (e.g., numeric vs. verbal format) and context length; (2) LLMs exhibit unexpectedly strong sample generation capability, yet suffer over 60% performance degradation under long-context conditions, revealing a critical scalability bottleneck; (3) Probabilistic reasoning is shown to be non-robust and representation-dependent, highlighting fundamental limitations. These results establish essential benchmarks for trustworthy probabilistic AI and inform targeted architectural and prompting improvements.

Technology Category

Application Category

📝 Abstract
Despite widespread success in language understanding and generation, large language models (LLMs) exhibit unclear and often inconsistent behavior when faced with tasks that require probabilistic reasoning. In this work, we present the first comprehensive study of the reasoning capabilities of LLMs over explicit discrete probability distributions. Given observations from a probability distribution, we evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation, by prompting them to provide responses to queries about either the joint distribution or its conditionals. These tasks thus probe a range of probabilistic skills, including frequency analysis, marginalization, and generative behavior. Through comprehensive empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models, with the latter demonstrating stronger inference and surprising capabilities in sample generation. Furthermore, our investigations reveal notable limitations, including sensitivity to variations in the notation utilized to represent probabilistic outcomes and performance degradation of over 60% as context length increases. Together, our results provide a detailed understanding of the probabilistic reasoning abilities of LLMs and identify key directions for future improvement.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' probabilistic reasoning on discrete distributions
Assessing performance gaps between small and large language models
Identifying limitations in notation sensitivity and context length
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating probabilistic reasoning via prompting
Testing frequency analysis and marginalization skills
Assessing sample generation capabilities across models
🔎 Similar Papers
No similar papers found.