🤖 AI Summary
This study investigates the reliance of self-supervised speech representations on prosodic and lexical cues for turn-taking modeling in conversational dialogue. By leveraging vocoder-based techniques, the authors achieve a clean disentanglement of these two cue types for the first time—generating speech that preserves original prosody while rendering lexical content unintelligible. Using probing experiments with established self-supervised models such as CPC and wav2vec 2.0, they demonstrate that either prosody or lexical information alone suffices to attain turn-taking prediction performance nearly matching that of the original speech. Moreover, when one cue type is degraded, the model automatically compensates by leveraging the other without requiring retraining. These findings indicate that self-supervised representations encode prosodic and lexical information independently, exhibiting limited interdependence yet strong complementarity in turn-taking prediction.
📝 Abstract
Fluid turn-taking remains a key challenge in human-robot interaction. Self-supervised speech representations (S3Rs) have driven many advances, but it remains unclear whether S3R-based turn-taking models rely on prosodic cues, lexical cues or both. We introduce a vocoder-based approach to control prosody and lexical cues in speech more cleanly than prior work. This allows us to probe the voice-activity projection model, an S3R-based turn-taking model. We find that prediction on prosody-matched, unintelligible noise is similar to accuracy on clean speech. This reveals both prosodic and lexical cues support turn-taking, but either can be used in isolation. Hence, future models may only require prosody, providing privacy and potential performance benefits. When either prosodic or lexical information is disrupted, the model exploits the other without further training, indicating they are encoded in S3Rs with limited interdependence. Results are consistent in CPC-based and wav2vec2.0 S3Rs. We discuss our findings and highlight a number of directions for future work. All code is available to support future research.