🤖 AI Summary
Existing multimodal reasoning approaches often suffer from the loss of fine-grained visual details or compromised visual faithfulness due to suboptimal timing and manner of visual evidence integration. To address this limitation, this work proposes the CSMR framework, which introduces a novel cognitive scheduling mechanism: a language model dynamically controls an independent visual perception module, invoking task-relevant visual evidence on demand during reasoning. This design overcomes the constraints of static fusion and end-to-end joint optimization paradigms. Evaluated under zero-shot settings, the proposed method achieves substantial performance gains over state-of-the-art baselines across multiple multimodal benchmarks, demonstrating that dynamic scheduling effectively enhances both reasoning accuracy and visual faithfulness.
📝 Abstract
Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.