🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant limitations in jointly performing fine-grained visual perception and commonsense causal reasoning. Method: We introduce Argus Inspection, a novel two-tier multimodal benchmark that uniquely integrates fine-grained visual recognition with real-world causal reasoning evaluation; propose the Eye of Panoptes evaluation framework, which innovatively combines a binary-parameterized sigmoid function with an indicator function to holistically quantify opinion-based reasoning; and design a standardized causal reasoning evaluation protocol. Contribution/Results: Empirical evaluation across 26 state-of-the-art MLLMs reveals that the highest fine-grained visual reasoning score is merely 0.46, demonstrating a fundamental deficiency in perception–reasoning synergy. This work establishes the first dedicated benchmark and evaluation framework for diagnosing coupled visual–causal reasoning capabilities in MLLMs, exposing critical gaps in current modeling paradigms.
📝 Abstract
As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.