🤖 AI Summary
This work addresses the limitations of current large language model (LLM)-based speech recognition systems, which typically process utterances in isolation and struggle to leverage conversational context effectively. The raw audio context grows rapidly with dialogue length, leading to prohibitive computational costs. To overcome this, the authors propose a multimodal dialogue context modeling approach that integrates historical audio and textual transcripts through multi-turn supervised training. Central to their method is an “abstract compression” mechanism, which replaces lengthy historical audio with a fixed number of learnable latent tokens, substantially reducing representation size while preserving essential contextual information. Experiments demonstrate that the proposed method recovers the recognition gains afforded by full historical context using significantly less contextual memory, achieving consistent improvements on both in-domain and out-of-domain test sets—particularly for context-dependent entity recognition.
📝 Abstract
Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.