🤖 AI Summary
This study evaluates the generalization and robustness of large language models (LLMs) and large reasoning models (LRMs) on analogical and mathematical reasoning under uncertainty. Method: We introduce I-RAVEN-X, the first symbolic benchmark explicitly designed for uncertain environments, featuring systematically increased operational complexity, expanded attribute spaces, and explicit modeling of perceptual uncertainty—enabling controlled assessment of multi-hypothesis probabilistic reasoning. Our methodology integrates symbolic logic-based generation, procedural task construction, and variable-isolation experimental design, complemented by structural and generative evaluation metrics. Contribution/Results: Experiments demonstrate that LRMs significantly outperform LLMs on long-chain and high-dimensional attribute reasoning tasks; however, both model families struggle to enumerate and weigh multiple plausible solutions under uncertainty, revealing fundamental limitations in their probabilistic reasoning mechanisms.
📝 Abstract
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.