The Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study investigates the reliance of self-supervised speech representations on prosodic and lexical cues for turn-taking modeling in conversational dialogue. By leveraging vocoder-based techniques, the authors achieve a clean disentanglement of these two cue types for the first time—generating speech that preserves original prosody while rendering lexical content unintelligible. Using probing experiments with established self-supervised models such as CPC and wav2vec 2.0, they demonstrate that either prosody or lexical information alone suffices to attain turn-taking prediction performance nearly matching that of the original speech. Moreover, when one cue type is degraded, the model automatically compensates by leveraging the other without requiring retraining. These findings indicate that self-supervised representations encode prosodic and lexical information independently, exhibiting limited interdependence yet strong complementarity in turn-taking prediction.

Technology Category

Application Category

📝 Abstract

Fluid turn-taking remains a key challenge in human-robot interaction. Self-supervised speech representations (S3Rs) have driven many advances, but it remains unclear whether S3R-based turn-taking models rely on prosodic cues, lexical cues or both. We introduce a vocoder-based approach to control prosody and lexical cues in speech more cleanly than prior work. This allows us to probe the voice-activity projection model, an S3R-based turn-taking model. We find that prediction on prosody-matched, unintelligible noise is similar to accuracy on clean speech. This reveals both prosodic and lexical cues support turn-taking, but either can be used in isolation. Hence, future models may only require prosody, providing privacy and potential performance benefits. When either prosodic or lexical information is disrupted, the model exploits the other without further training, indicating they are encoded in S3Rs with limited interdependence. Results are consistent in CPC-based and wav2vec2.0 S3Rs. We discuss our findings and highlight a number of directions for future work. All code is available to support future research.

Problem

Research questions and friction points this paper is trying to address.

turn-taking

prosodic cues

lexical cues

self-supervised speech representations

human-robot interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised speech representations

turn-taking

prosodic cues