🤖 AI Summary
Current multimodal large language model (MLLM) evaluations over-rely on end-to-end accuracy, neglecting robustness of visual perception foundations, attribution fidelity, and reasoning under perturbations—thus failing to disentangle performance gains from genuine visual understanding versus textual priors.
Method: We propose the first systematic perceptual observability framework, establishing a vertical evaluation taxonomy that decouples linguistic and visual capabilities. Using controlled stress tests—including pixel-level perturbations, diffusion-based hallucination generation, grid pointer games, and attribute localization—we quantitatively assess visual grounding, textual comprehension, and spatial localization on face- and text-centric ground-truth datasets.
Contribution/Results: Experiments expose severe visual grounding fragility across mainstream MLLMs, demonstrating that their apparent performance improvements stem predominantly from internet-scale textual knowledge rather than faithful interpretation of visual signals. Our framework provides a novel diagnostic benchmark for trustworthy multimodal AI.
📝 Abstract
Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.