Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Current vision-language models excel on standard benchmarks but exhibit markedly poor robustness when confronted with visual illusions and anomalous scenes that violate common sense. To address this gap, this work introduces VIA-Bench, a novel benchmark encompassing six categories of visual illusions and anomalies, featuring over 1,000 high-quality human-verified question-answer pairs. VIA-Bench is the first to systematically incorporate such cognitive challenges into the evaluation of multimodal models. Using this benchmark, we evaluate more than twenty state-of-the-art multimodal large language models and uncover widespread perceptual fragility. Notably, chain-of-thought reasoning not only fails to improve performance but often exacerbates “fragile hallucinations.” Our findings highlight a fundamental divergence between model perception and human cognition, offering new directions for enhancing commonsense robustness in multimodal systems.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable proficiency on general-purpose vision-language benchmarks, reaching or even exceeding human-level performance. However, these evaluations typically rely on standard in-distribution data, leaving the robustness of MLLMs largely unexamined when faced with scenarios that defy common-sense priors. To address this gap, we introduce VIA-Bench, a challenging benchmark designed to probe model performance on visual illusions and anomalies. It includes six core categories: color illusions, motion illusions, gestalt illusions, geometric and spatial illusions, general visual illusions, and visual anomalies. Through careful human-in-the-loop review, we construct over 1K high-quality question-answer pairs that require nuanced visual reasoning. Extensive evaluation of over 20 state-of-the-art MLLMs, including proprietary, open-source, and reasoning-enhanced models, uncovers significant vulnerabilities. Notably, we find that Chain-of-Thought (CoT) reasoning offers negligible robustness, often yielding ``brittle mirages''where the model's logic collapses under illusory stimuli. Our findings reveal a fundamental divergence between machine and human perception, suggesting that resolving such perceptual bottlenecks is critical for the advancement of artificial general intelligence. The benchmark data and code will be released.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Visual Illusions

Visual Anomalies

Robustness

Perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Visual Illusions

Robustness Benchmark