🤖 AI Summary
This paper addresses the challenge of turn-taking prediction in multiparty dialogue. We extend Voice Activity Projection (VAP)—previously limited to dyadic settings—to triadic conversations, proposing the first triadic multi-speaker VAP model. The model operates solely on raw audio signals, requiring neither transcripts nor speaker annotations, and is trained and evaluated on a Japanese triadic dialogue dataset. Our key contribution lies in designing a temporal modeling architecture specifically tailored to three-party interaction dynamics, and in systematically demonstrating that dialogue type (e.g., cooperative vs. competitive) significantly impacts prediction performance. Experiments show that our approach consistently outperforms both bidirectional VAP baselines and other state-of-the-art turn-taking prediction models across multiple backbone architectures. These results validate the effectiveness and robustness of the VAP framework for scaling to more complex, multi-speaker conversational scenarios.
📝 Abstract
Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.