Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the contextual memory and utilization capabilities of open-source multimodal conversational speech models, revealing a significant deficit compared to their text-only counterparts. To address this gap, the authors introduce ContextDialog—the first dedicated benchmark for evaluating context retention in speech-based dialogue—and empirically identify the “speech-modality-induced context forgetting” phenomenon. They systematically assess mitigation strategies including ASR joint modeling, retrieval-augmented generation (RAG), and consistency-aware evaluation. Results show that mainstream open-source speech models exhibit a 32.7% lower recall accuracy for information mentioned in prior speech turns than text models; even with RAG, cross-turn question-answering F1 remains below 41%. This study establishes a novel evaluation paradigm for context memory in spoken dialogue and demonstrates that inherent modality-specific characteristics fundamentally constrain memory performance.

Technology Category

Application Category

📝 Abstract
Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we proposed for this purpose. Our findings show that speech-based models have more difficulty than text-based ones, especially when recalling information conveyed in speech, and even with retrieval-augmented generation, models still struggle with questions about past utterances. These insights highlight key limitations in open-source models and suggest ways to improve memory retention and retrieval robustness.
Problem

Research questions and friction points this paper is trying to address.

Evaluates open-source voice interaction models' context recall
Compares speech-based and text-based models' memory utilization
Identifies limitations in retrieval-augmented generation for past utterances
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates open-source voice models
Introduces ContextDialog benchmark
Compares speech and text model performance
🔎 Similar Papers
No similar papers found.