🤖 AI Summary
This study investigates the perceptual understanding and abstract relational reasoning capabilities of multimodal large language models (MLLMs) in cross-image visual analogical reasoning. Method: We introduce VOILA, the first open-ended, generative multi-image analogical reasoning benchmark, requiring models to synthesize a novel image completing an analogy—moving beyond conventional closed-set multiple-choice evaluation. We propose a dynamic generation-based evaluation framework integrating analogical mapping modeling, multi-step least-to-most prompting, and cross-modal relational disentanglement analysis, systematically assessed on models including LLaMA-3.2 and GPT-4o. Contribution/Results: Experiments reveal severe limitations: current MLLMs achieve only 29% and 13% accuracy on simple and challenging analogies, respectively—substantially below human performance (70%), exposing fundamental deficits in higher-order abstract reasoning. Our framework significantly improves performance, demonstrating that open-ended generative evaluation is both effective and necessary for assessing the true analogical reasoning capacity of MLLMs.
📝 Abstract
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. Despite their exceptional performance on visual understanding benchmarks, measuring their ability to reason abstractly across multiple images remains a significant challenge. To address this, we introduce VOILA, a large-scale, open-ended, dynamic benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. VOILA employs an analogical mapping approach in the visual domain, requiring models to generate an image that completes an analogy between two given image pairs, reference and application, without relying on predefined choices. Our experiments demonstrate that the analogical reasoning tasks in VOILA present a challenge to MLLMs. Through multi-step analysis, we reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning. Notably, we observe that performance improves when following a multi-step strategy of least-to-most prompting. Comprehensive evaluations on open-source models and GPT-4o show that on text-based answers, the best accuracy for challenging scenarios is 13% (LLaMa 3.2) and even for simpler tasks is only 29% (GPT-4o), while human performance is significantly higher at 70% across both difficulty levels.