Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large audio language models struggle to effectively retain non-speech acoustic information across multi-turn interactions, exhibiting a significant gap between semantic and acoustic understanding capabilities, with their internal representations and memory mechanisms remaining poorly understood. This work introduces EnvMem, a multi-turn benchmark that systematically investigates representational drift and attention allocation in acoustic memory through controlled testing, structural probing of representations, and dynamic attention analysis. The study reveals, for the first time, that trajectory drift in internal representations is the primary cause of acoustic memory degradation, whereas attention allocation plays a limited role. Building on these insights, the authors establish the first systematic analytical framework for acoustic memory in audio language models, offering theoretical foundations for future data curation and training strategies.
📝 Abstract
Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.
Problem

Research questions and friction points this paper is trying to address.

acoustic memory
multi-turn interaction
representation bottleneck
retrieval bottleneck
non-speech audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

acoustic memory
representation drift
multi-turn benchmark
audio language models
attention dynamics