Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of turn-taking prediction in multiparty dialogue. We extend Voice Activity Projection (VAP)—previously limited to dyadic settings—to triadic conversations, proposing the first triadic multi-speaker VAP model. The model operates solely on raw audio signals, requiring neither transcripts nor speaker annotations, and is trained and evaluated on a Japanese triadic dialogue dataset. Our key contribution lies in designing a temporal modeling architecture specifically tailored to three-party interaction dynamics, and in systematically demonstrating that dialogue type (e.g., cooperative vs. competitive) significantly impacts prediction performance. Experiments show that our approach consistently outperforms both bidirectional VAP baselines and other state-of-the-art turn-taking prediction models across multiple backbone architectures. These results validate the effectiveness and robustness of the VAP framework for scaling to more complex, multi-speaker conversational scenarios.

Technology Category

Application Category

📝 Abstract
Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.
Problem

Research questions and friction points this paper is trying to address.

Extends voice activity projection to triadic multi-party conversations
Predicts future turn-taking using only acoustic data
Evaluates VAP performance in Japanese triadic dialogue scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends VAP to triadic multi-party conversations
Uses acoustic data to predict speaker turn-taking
Trains models on Japanese triadic dialogue dataset