🤖 AI Summary
This work addresses a critical gap in existing evaluations, which overlook the capacity of multimodal large language models to compose rules from multiple sources for higher-order analogical reasoning. To this end, the authors introduce CARV, a diagnostic visual benchmark comprising 5,500 samples that extends analogical reasoning from single object-pairs to multiple object-pairs, requiring models to extract symbolic rules from each pair and compose them to generate novel transformations. CARV is the first benchmark specifically designed for compositional analogical reasoning, revealing fundamental limitations in current models’ rule extraction capabilities and robustness in complex scenes. Experiments show that even state-of-the-art models such as Gemini-2.5 Pro achieve only 40.4% accuracy—far below human performance at 100%—and exhibit two distinct systematic failure modes.
📝 Abstract
Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task.