๐ค AI Summary
This study investigates how large language models (LLMs) generate and update hypotheses under incomplete information and evaluates whether their reasoning approximates Bayesian optimality. Using a โnumber gameโ task, the authors systematically compare LLMs, an optimal Bayesian model, and human behavior through three types of probes: posterior prediction, hypothesis evaluation, and hypothesis generation. The findings indicate that LLM behavior can be reasonably approximated by a two-parameter Bayesian model, yet exhibits systematic biases. Notably, LLMs tend to produce simpler, more rule-like hypotheses, reflecting an implicit Occamโs razor effect, and demonstrate limited generalization to unseen data domains. These results illuminate both the potential and the constraints of LLMs in scientific reasoning tasks.
๐ Abstract
Large language models (LLMs) increasingly help people solve problems, from debugging code to repairing machinery. This process requires generating plausible hypotheses from partial descriptions, then updating them as more information arrives. Yet how LLMs perform this form of inference, and how close it is to optimal, remains unclear. We study this question in the number game, a controlled setting in which a learner infers the hypothesis supported by a few positive integers, such as $\{16, 8, 2, 64\}$: a rule like powers of 2 or an interval like numbers near 20. We measure the posterior over hypotheses using three complementary probes: posterior prediction, hypothesis evaluation, and hypothesis generation. We then compare LLM behavior with an optimal Bayesian model and human behavior, and test whether the same posterior is expressed across probes. LLMs are often well described by a two-parameter Bayesian fit, but with systematic offsets: by default they show a strong-sampling assumption that creates an implicit Occam's razor, favoring narrower hypotheses, while thinking mode shifts them toward greater prior reliance. We also find a robust evaluation--generation gap: LLMs select more correct hypotheses during hypothesis evaluation but generate simpler, more rule-like hypotheses. Finally, this Bayesian-with-bias pattern does not extrapolate. Models can behave as if they hold rule-like hypotheses over observed examples, yet generalize poorly to parts of the hypothesis domain not covered by those examples. Our results highlight a limitation of LLMs as general problem solvers, especially for scientific inference, where hypotheses must go beyond the data.