Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current large language model (LLM)-based speech recognition systems, which typically process utterances in isolation and struggle to leverage conversational context effectively. The raw audio context grows rapidly with dialogue length, leading to prohibitive computational costs. To overcome this, the authors propose a multimodal dialogue context modeling approach that integrates historical audio and textual transcripts through multi-turn supervised training. Central to their method is an “abstract compression” mechanism, which replaces lengthy historical audio with a fixed number of learnable latent tokens, substantially reducing representation size while preserving essential contextual information. Experiments demonstrate that the proposed method recovers the recognition gains afforded by full historical context using significantly less contextual memory, achieving consistent improvements on both in-domain and out-of-domain test sets—particularly for context-dependent entity recognition.
📝 Abstract
Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.
Problem

Research questions and friction points this paper is trying to address.

conversational context
LLM-based ASR
audio compression
contextual entity recognition
multimodal context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Abstract Compression
LLM-based ASR
conversational context
latent tokens
multi-turn speech recognition
🔎 Similar Papers
No similar papers found.