Scaling Test-time Inference for Visual Grounding

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the performance gap between small and large vision-language models in visual grounding tasks, where smaller models suffer from limited language understanding while larger ones incur high deployment costs and inference latency. To bridge this gap without increasing model parameters, the authors propose Expansion via Generation at test time (EGM), a method that dynamically increases the number of generated tokens to enhance semantic reasoning capabilities. Instantiated as EGM-Qwen3-VL-8B based on the Qwen-VL architecture, the approach achieves 91.4 IoU on RefCOCO with an average latency of only 737 ms—5.9× faster than Qwen3-VL-235B while delivering superior performance. Furthermore, it significantly outperforms existing baselines on more challenging non-modular grounding tasks.

Technology Category

Application Category

📝 Abstract

Visual grounding is an essential capability of Visual Language Models (VLMs) to understand the real physical world. Previous state-of-the-art grounding visual language models usually have large model sizes, making them heavy for deployment and slow for inference. However, we notice that the sizes of visual encoders are nearly the same for small and large VLMs and the major difference is the sizes of the language models. Small VLMs fall behind larger VLMs in grounding because of the difference in language understanding capability rather than visual information handling. To mitigate the gap, we introduce'Efficient visual Grounding language Models'(EGM): a method to scale the test-time computation (#generated tokens). Scaling the test-time computation of a small model is deployment-friendly, and yields better end-to-end latency as the cost of each token is much cheaper compared to directly running a large model. On the RefCOCO benchmark, our EGM-Qwen3-VL-8B demonstrates 91.4 IoU with an average of 737ms (5.9x faster) latency while Qwen3-VL-235B demands 4,320ms to achieve 90.5 IoU. To validate our approach's generality, we further set up a new amodal grounding setting that requires the model to predict both the visible and occluded parts of the objects. Experiments show our method can consistently and significantly improve the vanilla grounding and amodal grounding capabilities of small models to be on par with or outperform the larger models, thereby improving the efficiency for visual grounding.

Problem

Research questions and friction points this paper is trying to address.

visual grounding

visual language models

model scaling

test-time inference

amodal grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling

visual grounding

efficient inference