🤖 AI Summary
Existing large audio language models rely on one-pass audio encoding, which limits their capacity for deep understanding of complex audio content. This work proposes an audio interleaved reasoning mechanism that treats audio as an active reasoning component, dynamically re-listening to critical segments during generation to overcome the constraints of conventional encoding paradigms. The approach employs a two-stage training framework: first, supervised fine-tuning identifies salient audio regions, followed by reinforcement learning to optimize the re-listening strategy. A structured data generation pipeline is further developed to support efficient training. Extensive evaluations on both expert-level and general-purpose audio understanding benchmarks demonstrate state-of-the-art performance, confirming the effectiveness and generalization capability of the proposed paradigm.
📝 Abstract
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.