The Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the reliance of self-supervised speech representations on prosodic and lexical cues for turn-taking modeling in conversational dialogue. By leveraging vocoder-based techniques, the authors achieve a clean disentanglement of these two cue types for the first time—generating speech that preserves original prosody while rendering lexical content unintelligible. Using probing experiments with established self-supervised models such as CPC and wav2vec 2.0, they demonstrate that either prosody or lexical information alone suffices to attain turn-taking prediction performance nearly matching that of the original speech. Moreover, when one cue type is degraded, the model automatically compensates by leveraging the other without requiring retraining. These findings indicate that self-supervised representations encode prosodic and lexical information independently, exhibiting limited interdependence yet strong complementarity in turn-taking prediction.

Technology Category

Application Category

📝 Abstract
Fluid turn-taking remains a key challenge in human-robot interaction. Self-supervised speech representations (S3Rs) have driven many advances, but it remains unclear whether S3R-based turn-taking models rely on prosodic cues, lexical cues or both. We introduce a vocoder-based approach to control prosody and lexical cues in speech more cleanly than prior work. This allows us to probe the voice-activity projection model, an S3R-based turn-taking model. We find that prediction on prosody-matched, unintelligible noise is similar to accuracy on clean speech. This reveals both prosodic and lexical cues support turn-taking, but either can be used in isolation. Hence, future models may only require prosody, providing privacy and potential performance benefits. When either prosodic or lexical information is disrupted, the model exploits the other without further training, indicating they are encoded in S3Rs with limited interdependence. Results are consistent in CPC-based and wav2vec2.0 S3Rs. We discuss our findings and highlight a number of directions for future work. All code is available to support future research.
Problem

Research questions and friction points this paper is trying to address.

turn-taking
prosodic cues
lexical cues
self-supervised speech representations
human-robot interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised speech representations
turn-taking
prosodic cues
lexical cues
vocoder-based manipulation
🔎 Similar Papers
No similar papers found.
S
Sam O’Connor Russell
School of Engineering, Trinity College Dublin, Ireland
D
Delphine Charuau
School of Engineering, Trinity College Dublin, Ireland
Naomi Harte
Naomi Harte
Professor in Speech Technology, Trinity College Dublin
Audio-visual speech recognitionspeech qualitymultimodal interactionbirdsong analysis