I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study evaluates the generalization and robustness of large language models (LLMs) and large reasoning models (LRMs) on analogical and mathematical reasoning under uncertainty. Method: We introduce I-RAVEN-X, the first symbolic benchmark explicitly designed for uncertain environments, featuring systematically increased operational complexity, expanded attribute spaces, and explicit modeling of perceptual uncertainty—enabling controlled assessment of multi-hypothesis probabilistic reasoning. Our methodology integrates symbolic logic-based generation, procedural task construction, and variable-isolation experimental design, complemented by structural and generative evaluation metrics. Contribution/Results: Experiments demonstrate that LRMs significantly outperform LLMs on long-chain and high-dimensional attribute reasoning tasks; however, both model families struggle to enumerate and weigh multiple plausible solutions under uncertainty, revealing fundamental limitations in their probabilistic reasoning mechanisms.

Technology Category

Application Category

📝 Abstract
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating generalization in analogical reasoning for large models
Assessing robustness of mathematical reasoning under uncertainty
Testing model performance on extended attribute ranges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended symbolic benchmark with increased operand complexity
Introduced perceptual uncertainty in reasoning evaluation
Enhanced attribute range for systematicity assessment
🔎 Similar Papers
No similar papers found.