ProxyThinker: Test-Time Guidance through Small Visual Reasoners

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Large vision-language models (LVLMs) struggle to acquire fine-grained visual reasoning capabilities cost-effectively. Method: This paper proposes a zero-training, decoding-time reasoning enhancement method that dynamically couples a lightweight visual reasoner during inference. It guides the LVLM toward self-verification and corrective “slow thinking” via output-distribution differencing and dynamic reweighting—without parameter updates or reinforcement fine-tuning. Contribution/Results: To our knowledge, this is the first approach enabling training-free capability transfer from small to large models. It achieves state-of-the-art performance on spatial reasoning, math-oriented visual question answering, and multi-domain benchmarks—matching full-scale reinforcement fine-tuning (RFT) models—while accelerating inference by 38× over comparable methods. Moreover, it supports multilingual LVLM co-optimization without architectural modification.

Technology Category

Application Category

📝 Abstract

Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $ imes$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.

Problem

Research questions and friction points this paper is trying to address.

Enables large models to inherit reasoning from small visual reasoners without training

Improves performance on spatial, mathematical, and multi-disciplinary visual reasoning

Achieves faster inference compared to previous decoding-time methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time technique without training

Modifies decoding dynamics for reasoning

Parallelism enables faster inference

🔎 Similar Papers

No similar papers found.