Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study systematically investigates the effectiveness and applicability boundaries of test-time scaling (TTS) in vision-language models (VLMs), particularly addressing performance disparities between open- and closed-source VLMs on multi-step reasoning versus perception-dominant tasks. Method: We propose a TTS framework integrating structured reasoning, self-reflection, and external verification, and conduct cross-model, cross-benchmark empirical analysis. Contribution/Results: We make the first observation that open-source VLMs often suffer from performance degradation during iterative self-optimization, whereas external verification proves more robust; in contrast, closed-source models benefit more from structured reasoning. Based on these findings, we introduce a “task–model co-adaptation” TTS paradigm to enable adaptive TTS design and multimodal reward modeling. Experiments demonstrate significant performance gains on multi-step reasoning benchmarks, while improvements on perception-oriented benchmarks remain limited.

Technology Category

Application Category

📝 Abstract

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating test-time scaling's effectiveness across vision-language models.

Comparing open-source and closed-source VLMs' reasoning gains from TTS.

Identifying dataset-dependent TTS impacts on reasoning versus perception tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time scaling applied to vision-language models

External verification improves open-source VLMs performance

Adaptive TTS strategies tailored to model and task

🔎 Similar Papers

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision Language Models