VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study investigates the perceptual understanding and abstract relational reasoning capabilities of multimodal large language models (MLLMs) in cross-image visual analogical reasoning. Method: We introduce VOILA, the first open-ended, generative multi-image analogical reasoning benchmark, requiring models to synthesize a novel image completing an analogy—moving beyond conventional closed-set multiple-choice evaluation. We propose a dynamic generation-based evaluation framework integrating analogical mapping modeling, multi-step least-to-most prompting, and cross-modal relational disentanglement analysis, systematically assessed on models including LLaMA-3.2 and GPT-4o. Contribution/Results: Experiments reveal severe limitations: current MLLMs achieve only 29% and 13% accuracy on simple and challenging analogies, respectively—substantially below human performance (70%), exposing fundamental deficits in higher-order abstract reasoning. Our framework significantly improves performance, demonstrating that open-ended generative evaluation is both effective and necessary for assessing the true analogical reasoning capacity of MLLMs.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. Despite their exceptional performance on visual understanding benchmarks, measuring their ability to reason abstractly across multiple images remains a significant challenge. To address this, we introduce VOILA, a large-scale, open-ended, dynamic benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. VOILA employs an analogical mapping approach in the visual domain, requiring models to generate an image that completes an analogy between two given image pairs, reference and application, without relying on predefined choices. Our experiments demonstrate that the analogical reasoning tasks in VOILA present a challenge to MLLMs. Through multi-step analysis, we reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning. Notably, we observe that performance improves when following a multi-step strategy of least-to-most prompting. Comprehensive evaluations on open-source models and GPT-4o show that on text-based answers, the best accuracy for challenging scenarios is 13% (LLaMa 3.2) and even for simpler tasks is only 29% (GPT-4o), while human performance is significantly higher at 70% across both difficulty levels.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' abstract reasoning across multiple images.

Challenging MLLMs with analogical mapping in visual domain.

Assessing MLLMs' inter-image relationship comprehension and reasoning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

VOILA benchmark evaluates MLLMs' perceptual understanding.

Analogical mapping approach tests abstract relational reasoning.

Multi-step least-to-most prompting improves MLLMs' performance.

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?