🤖 AI Summary
This study systematically investigates the effectiveness and applicability boundaries of test-time scaling (TTS) in vision-language models (VLMs), particularly addressing performance disparities between open- and closed-source VLMs on multi-step reasoning versus perception-dominant tasks.
Method: We propose a TTS framework integrating structured reasoning, self-reflection, and external verification, and conduct cross-model, cross-benchmark empirical analysis.
Contribution/Results: We make the first observation that open-source VLMs often suffer from performance degradation during iterative self-optimization, whereas external verification proves more robust; in contrast, closed-source models benefit more from structured reasoning. Based on these findings, we introduce a “task–model co-adaptation” TTS paradigm to enable adaptive TTS design and multimodal reward modeling. Experiments demonstrate significant performance gains on multi-step reasoning benchmarks, while improvements on perception-oriented benchmarks remain limited.
📝 Abstract
Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.