Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the effectiveness and applicability boundaries of test-time scaling (TTS) in vision-language models (VLMs), particularly addressing performance disparities between open- and closed-source VLMs on multi-step reasoning versus perception-dominant tasks. Method: We propose a TTS framework integrating structured reasoning, self-reflection, and external verification, and conduct cross-model, cross-benchmark empirical analysis. Contribution/Results: We make the first observation that open-source VLMs often suffer from performance degradation during iterative self-optimization, whereas external verification proves more robust; in contrast, closed-source models benefit more from structured reasoning. Based on these findings, we introduce a “task–model co-adaptation” TTS paradigm to enable adaptive TTS design and multimodal reward modeling. Experiments demonstrate significant performance gains on multi-step reasoning benchmarks, while improvements on perception-oriented benchmarks remain limited.

Technology Category

Application Category

📝 Abstract
Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating test-time scaling's effectiveness across vision-language models.
Comparing open-source and closed-source VLMs' reasoning gains from TTS.
Identifying dataset-dependent TTS impacts on reasoning versus perception tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time scaling applied to vision-language models
External verification improves open-source VLMs performance
Adaptive TTS strategies tailored to model and task
🔎 Similar Papers
No similar papers found.
M
Mohammadjavad Ahmadpour
Department of Computer Engineering, Sharif University of Technology
A
Amirmahdi Meighani
Department of Computer Engineering, Sharif University of Technology
P
Payam Taebi
Department of Computer Engineering, Sharif University of Technology
Omid Ghahroodi
Omid Ghahroodi
Research Assistant at Qatar Computing Research Institute, Sharif University of Technology Alumni
Machine LearningDeep LearningNatural Language ProcessingLLMVLM
A
Amirmohammad Izadi
Department of Computer Engineering, Sharif University of Technology
Mahdieh Soleymani Baghshah
Mahdieh Soleymani Baghshah
Associate Professor, Computer Engineering Department, Sharif University of Technology
Deep LearningMachine Learning