🤖 AI Summary
Existing multimodal reasoning models suffer from redundant self-reflection and excessively long reasoning chains; moreover, mainstream training-free Chain-of-Thought (CoT) compression methods rely on static visual references, limiting adaptability to dynamic visual reasoning tasks. This paper proposes ChainV, a fine-tuning-free framework for dynamic visual prompt enhancement. Its core innovations include: (1) automatic selection of atomic-level visual prompts based on mean attention intensity; (2) adaptive control of reasoning depth via consistency evaluation and a Bernoulli stochastic process; and (3) integration of coarse-grained block filtering, attention-driven fine-grained visual extraction, reliability assessment, and dynamic thought injection. Evaluated on the MathVista benchmark, ChainV achieves a 2.3% accuracy gain, reduces inference latency by 51.4%, and decreases output token count by 24.5%, significantly improving both efficiency and accuracy in vision-dependent mathematical reasoning.
📝 Abstract
Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4%$ and shortening output token length by $24.5%$.