Efficient Test-Time Scaling for Small Vision-Language Models

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Small vision-language models (VLMs) offer computational efficiency but suffer from limited generalization; existing test-time scaling methods often incur substantial computational overhead, contradicting their lightweight design. This paper proposes two unsupervised, zero-parameter-increase test-time scaling strategies: (1) Test-Time Augmentation (TTAug), which performs token-level output aggregation via internal model features; and (2) Test-Time Adaptation (TTAdapt), which dynamically calibrates model parameters using consistency-driven pseudo-labels. Both methods require no external data or annotations, leveraging only intrinsic structural and semantic consistency within test samples. Evaluated across nine diverse benchmarks, the approaches significantly improve accuracy of small VLMs while preserving low inference latency and strong generalization. Moreover, they demonstrate broad architectural and scale agnosticism—applicable to various VLM backbones and sizes without modification.

Technology Category

Application Category

📝 Abstract
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
Problem

Research questions and friction points this paper is trying to address.

Improving generalization of small vision-language models efficiently
Reducing computational demands of test-time scaling techniques
Enhancing performance without external supervision or parameter tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Augmentation aggregates token-level outputs
Test-Time Adaptation updates parameters using pseudolabels
Leverages internal features without external supervision
🔎 Similar Papers
No similar papers found.