🤖 AI Summary
This work investigates the contextual memory and utilization capabilities of open-source multimodal conversational speech models, revealing a significant deficit compared to their text-only counterparts. To address this gap, the authors introduce ContextDialog—the first dedicated benchmark for evaluating context retention in speech-based dialogue—and empirically identify the “speech-modality-induced context forgetting” phenomenon. They systematically assess mitigation strategies including ASR joint modeling, retrieval-augmented generation (RAG), and consistency-aware evaluation. Results show that mainstream open-source speech models exhibit a 32.7% lower recall accuracy for information mentioned in prior speech turns than text models; even with RAG, cross-turn question-answering F1 remains below 41%. This study establishes a novel evaluation paradigm for context memory in spoken dialogue and demonstrates that inherent modality-specific characteristics fundamentally constrain memory performance.
📝 Abstract
Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we proposed for this purpose. Our findings show that speech-based models have more difficulty than text-based ones, especially when recalling information conveyed in speech, and even with retrieval-augmented generation, models still struggle with questions about past utterances. These insights highlight key limitations in open-source models and suggest ways to improve memory retention and retrieval robustness.