🤖 AI Summary
In human dialogue, listener nodding serves as a critical nonverbal feedback cue; however, existing spoken dialogue systems struggle to generate diverse, naturalistic nodding behaviors in real time. To address this, we propose the first continuous, real-time head-nod prediction method tailored for attentive listening scenarios. Our approach fuses dual-stream audio features—extracted from both speaker and listener—and employs multi-task learning to jointly predict nodding timing and fine-grained nod types (e.g., affirming, echoing). We further introduce speech activity projection encoding and pretraining on large-scale generic dialogue data to enhance generalization. Experiments demonstrate that our method significantly outperforms conventional synchronization baselines in subjective naturalness, achieves end-to-end inference latency under 50 ms, and incurs negligible accuracy degradation. The model has been successfully integrated into an open-source virtual listener system.
📝 Abstract
In human dialogue, nonverbal information such as nodding and facial expressions is as crucial as verbal information, and spoken dialogue systems are also expected to express such nonverbal behaviors. We focus on nodding, which is critical in an attentive listening system, and propose a model that predicts both its timing and type in real time. The proposed model builds on the voice activity projection (VAP) model, which predicts voice activity from both listener and speaker audio. We extend it to prediction of various types of nodding in a continuous and real-time manner unlike conventional models. In addition, the proposed model incorporates multi-task learning with verbal backchannel prediction and pretraining on general dialogue data. In the timing and type prediction task, the effectiveness of multi-task learning was significantly demonstrated. We confirmed that reducing the processing rate enables real-time operation without a substantial drop in accuracy, and integrated the model into an avatar attentive listening system. Subjective evaluations showed that it outperformed the conventional method, which always does nodding in sync with verbal backchannel. The code and trained models are available at https://github.com/MaAI-Kyoto/MaAI.