🤖 AI Summary
This work evaluates the non-linguistic analogical reasoning capabilities of Large Reasoning Models (LRMs) under perceptual uncertainty, using the Raven’s Progressive Matrices (RPM) test as a benchmark. To this end, we introduce I-RAVEN-X—a novel dataset variant that simulates realistic visual noise by injecting confounding attributes and smoothing attribute distributions. We systematically reveal, for the first time, severe performance degradation of LRMs under such uncertainty: o3-mini accuracy drops from 86.6% to 17.0%, and DeepSeek R1 falls from 80.6% to 23.2%. This motivates a new robustness evaluation paradigm tailored to non-linguistic reasoning. Furthermore, we demonstrate the strong generalization of the neuro-symbolic probabilistic abductive model ARLC, which maintains 88.0% accuracy—a mere 10.6-point decline—substantially outperforming all evaluated LRMs. Our results underscore the critical value of symbolic priors in modeling perceptual uncertainty.
📝 Abstract
This work presents a first evaluation of two state-of-the-art Large Reasoning Models (LRMs), OpenAI's o3-mini and DeepSeek R1, on analogical reasoning, focusing on well-established nonverbal human IQ tests based on Raven's progressive matrices. We benchmark with the I-RAVEN dataset and its more difficult extension, I-RAVEN-X, which tests the ability to generalize to longer reasoning rules and ranges of the attribute values. To assess the influence of visual uncertainties on these nonverbal analogical reasoning tests, we extend the I-RAVEN-X dataset, which otherwise assumes an oracle perception. We adopt a two-fold strategy to simulate this imperfect visual perception: 1) we introduce confounding attributes which, being sampled at random, do not contribute to the prediction of the correct answer of the puzzles and 2) smoothen the distributions of the input attributes' values. We observe a sharp decline in OpenAI's o3-mini task accuracy, dropping from 86.6% on the original I-RAVEN to just 17.0% -- approaching random chance -- on the more challenging I-RAVEN-X, which increases input length and range and emulates perceptual uncertainty. This drop occurred despite spending 3.4x more reasoning tokens. A similar trend is also observed for DeepSeek R1: from 80.6% to 23.2%. On the other hand, a neuro-symbolic probabilistic abductive model, ARLC, that achieves state-of-the-art performances on I-RAVEN, can robustly reason under all these out-of-distribution tests, maintaining strong accuracy with only a modest reduction from 98.6% to 88.0%. Our code is available at https://github.com/IBM/raven-large-language-models.