Aligning Spoken Dialogue Models from User Interactions

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing preference learning methods are designed for text-based language models and struggle to capture the dynamic interactions inherent in real-time spoken dialogue—such as implicit turn-taking, frequent interruptions, and overlapping speech. This work introduces the first preference alignment framework tailored specifically for spoken dialogue. We construct a large-scale, multi-turn spoken preference dataset comprising over 150,000 utterance pairs and pioneer the application of offline alignment techniques—including DPO and RLHF variants—to fine-tune full-duplex speech-to-speech models. Our approach integrates AI-generated feedback for automatic annotation, autoregressive speech modeling, and a multi-dimensional human evaluation protocol. Experiments demonstrate significant improvements in factual consistency, safety, and contextual coherence. End-to-end human evaluation confirms strong generalization across diverse conversational dynamics and establishes foundational principles for balancing multiple interactional modalities—e.g., timing, prosody, and turn management—in spoken dialogue systems.

Technology Category

Application Category

📝 Abstract
We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
Problem

Research questions and friction points this paper is trying to address.

Aligning spoken dialogue models with real-time user interactions
Addressing complexities of speech dynamics like interruptions and interjections
Improving model performance on linguistic content and temporal context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference alignment framework for spoken dialogue
Large-scale dataset with AI feedback annotations
Offline alignment for full-duplex speech model
🔎 Similar Papers
No similar papers found.