Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether visual language models (VLMs) exhibit inverse scaling—i.e., degraded performance and prolonged inference—under test-time visual distractions. Method: We construct Iids, a multi-type visual interference dataset for image question answering, covering semantic, numerical, and spatial interference, and systematically analyze how such interference affects both inference length and accuracy. We further propose an attribute-counting–based reasoning-path tracing method to characterize interference effects and design lightweight prompting strategies to mitigate erroneous predictions on bias benchmarks (e.g., Waterbirds). Contribution/Results: We empirically demonstrate that visual interference significantly reduces accuracy but does *not* increase inference steps—revealing a fundamental distinction from textual interference. This constitutes the first empirical validation of the universality of visual inverse scaling in VLMs. Our work provides an interpretable, intervention-aware analytical framework and effective mitigation strategies grounded in reasoning-path diagnostics.

Technology Category

Application Category

📝 Abstract
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
Problem

Research questions and friction points this paper is trying to address.

Investigates how visual distractors affect reasoning accuracy in vision-language models
Analyzes inverse scaling effects of irrelevant information during test-time computation
Explores methods to mitigate bias-driven predictions in multimodal reasoning systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced Idis dataset with systematic visual distractors
Analyzed inverse scaling effects in vision-language models
Proposed prompting strategy to mitigate bias-driven predictions
J
Jiyun Bae
Pohang University of Science and Technology (POSTECH)
H
Hyunjong Ok
Pohang University of Science and Technology (POSTECH)
Sangwoo Mo
Sangwoo Mo
Assistant Professor, POSTECH
Artificial IntelligenceDeep LearningMultimodal Models
J
Jaeho Lee
Pohang University of Science and Technology (POSTECH)