Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from high upgrade costs and poor scalability due to tight coupling between perception and reasoning. Method: This paper proposes a novel “reasoning-aligned perceptual decoupling” paradigm: visual inputs are first converted into task-oriented linguistic descriptions by a reward-optimized captioner, then processed by high-performance, text-only reasoning models. We introduce the first reinforcement learning–based closed-loop caption optimization framework, integrating reward modeling, multimodal alignment distillation, and a plug-and-play captioner architecture to ensure both visual fidelity and reasoning sufficiency of generated captions. Results: Our approach achieves state-of-the-art average performance on multimodal mathematics and science benchmarks. It enables zero-shot, zero-fine-tuning integration with next-generation reasoning LMs, entirely eliminating the need for end-to-end multimodal retraining.

Technology Category

Application Category

📝 Abstract

Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.

Problem

Research questions and friction points this paper is trying to address.

Decoupling perception from reasoning in multi-modal models

Optimizing visual captions for accurate downstream reasoning

Enhancing scalability without costly multi-modal re-alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples perception from reasoning via captioning

Uses reinforcement learning for caption reward optimization

Enhances visual grounding without retraining alignments

🔎 Similar Papers

No similar papers found.