Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from high upgrade costs and poor scalability due to tight coupling between perception and reasoning. Method: This paper proposes a novel “reasoning-aligned perceptual decoupling” paradigm: visual inputs are first converted into task-oriented linguistic descriptions by a reward-optimized captioner, then processed by high-performance, text-only reasoning models. We introduce the first reinforcement learning–based closed-loop caption optimization framework, integrating reward modeling, multimodal alignment distillation, and a plug-and-play captioner architecture to ensure both visual fidelity and reasoning sufficiency of generated captions. Results: Our approach achieves state-of-the-art average performance on multimodal mathematics and science benchmarks. It enables zero-shot, zero-fine-tuning integration with next-generation reasoning LMs, entirely eliminating the need for end-to-end multimodal retraining.

Technology Category

Application Category

📝 Abstract
Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.
Problem

Research questions and friction points this paper is trying to address.

Decoupling perception from reasoning in multi-modal models
Optimizing visual captions for accurate downstream reasoning
Enhancing scalability without costly multi-modal re-alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples perception from reasoning via captioning
Uses reinforcement learning for caption reward optimization
Enhances visual grounding without retraining alignments
🔎 Similar Papers
No similar papers found.
Y
Yunhao Gou
Southern University of Science and Technology, The Hong Kong University of Science and Technology
K
Kai Chen
The Hong Kong University of Science and Technology
Zhili Liu
Zhili Liu
Beike
SLAMDLHPCComputer Graphics
L
Lanqing Hong
Huawei Noah’s Ark Lab
X
Xin Jin
Huawei Cloud
Zhenguo Li
Zhenguo Li
Huawei Noah's Ark Lab, Columbia, CUHK, PKU
machine learninggenerative AIAI for mathematics
James T. Kwok
James T. Kwok
Professor of Computer Science and Engineering, Hong Kong University of Science and Technology
Machine learning
Y
Yu Zhang
Southern University of Science and Technology