🤖 AI Summary
Large vision-language models (LVLMs) struggle to acquire fine-grained visual reasoning capabilities cost-effectively.
Method: This paper proposes a zero-training, decoding-time reasoning enhancement method that dynamically couples a lightweight visual reasoner during inference. It guides the LVLM toward self-verification and corrective “slow thinking” via output-distribution differencing and dynamic reweighting—without parameter updates or reinforcement fine-tuning.
Contribution/Results: To our knowledge, this is the first approach enabling training-free capability transfer from small to large models. It achieves state-of-the-art performance on spatial reasoning, math-oriented visual question answering, and multi-domain benchmarks—matching full-scale reinforcement fine-tuning (RFT) models—while accelerating inference by 38× over comparable methods. Moreover, it supports multilingual LVLM co-optimization without architectural modification.
📝 Abstract
Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $ imes$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.