🤖 AI Summary
This work addresses the challenge of generating natural, diverse, and contextually appropriate listener body reactions from a speaker’s utterances in human–agent interaction. We propose ReactMotion, a novel framework that, for the first time, enables end-to-end modeling of the joint relationships among text, audio, emotion, and motion, augmented with a preference learning objective to optimize reaction appropriateness. To support this approach, we introduce ReactMotionNet, a large-scale one-to-many dataset of listener motions, and design a preference-based evaluation protocol focused on reaction quality. Experimental results demonstrate that our method significantly outperforms both retrieval-based baselines and cascaded large language model approaches in terms of motion naturalness, diversity, and contextual responsiveness.
📝 Abstract
In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.