Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

๐Ÿ“… 2026-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the issue of modality collapse in multimodal large language models during reinforcement learning fine-tuning, where uniform policy gradient updates across all tokens often lead to overreliance on textual priors and neglect of audio signals in long-chain reasoning, resulting in hallucinations. To mitigate this, the paper proposes Modality-Aware Policy Optimization (MAPO), a novel framework that dynamically identifies critical tokens via a cross-modal differential entropyโ€“based modality relevance mask. MAPO integrates a temporally scaled attention penalty and an auxiliary attention loss to sustain consistent focus on audio inputs without requiring domain-specific priors. Employing a dual-branch reinforcement learning architecture, it combines mask-guided policy gradients with attention regularization, significantly enhancing long-horizon reasoning fidelity and multimodal instruction-following capability, achieving state-of-the-art performance among open-source models on multiple audio reasoning benchmarks.
๐Ÿ“ Abstract
Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.
Problem

Research questions and friction points this paper is trying to address.

modality collapse
audio reasoning
reinforcement learning
cross-modal grounding
hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-Aware Policy Optimization
late-stage modality collapse
cross-modal grounding
reinforcement learning
audio reasoning
๐Ÿ”Ž Similar Papers