🤖 AI Summary
This work investigates the active reasoning capability of multimodal large language models (MLLMs) under information-deficient conditions—specifically, their ability to autonomously select missing evidence and iteratively refine decisions. To this end, we introduce GuessBench, the first benchmark explicitly designed for evaluating active reasoning in MLLMs; it uniquely distinguishes perceptual from knowledge-based images and systematically assesses models’ capacity to identify target images from a candidate pool without prior knowledge. Through cross-model ablation studies, we uncover distinct mechanisms by which perceptual enhancement and chain-of-thought reasoning affect models of varying scales. Evaluation across 20 state-of-the-art MLLMs reveals that active reasoning performance lags significantly behind passive reasoning, with core bottlenecks lying in insufficient fine-grained visual perception and suboptimal timing of evidence selection. This work establishes a novel paradigm and provides a critical diagnostic tool for advancing MLLMs toward embodied intelligence.
📝 Abstract
Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.