🤖 AI Summary
To address two critical bottlenecks in multimodal large language models—insufficient global contextual understanding and modality shortcuts (i.e., overreliance on linguistic priors while neglecting visual or textual cues)—this paper proposes a context-aware reinforcement learning framework. Methodologically, it introduces a joint optimization mechanism combining context consistency reward and logical reasoning reward; constructs IntentBench, the first benchmark dedicated to complex human intent and emotion reasoning across modalities; and employs an LLM-driven multi-dimensional reward signal (contextual fidelity, formatting compliance, accuracy, and logical coherence) with PPO-based multimodal RL fine-tuning. Additionally, it designs an omni-modal encoder-reasoner architecture enhanced with contextual grounding. Evaluated on multiple omni-modal benchmarks, the approach significantly outperforms open-source baselines: reducing context errors by 37%, mitigating modality shortcuts by 52%, and improving logical coherence by 41%.
📝 Abstract
With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.