ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of generating natural, diverse, and contextually appropriate listener body reactions from a speaker’s utterances in human–agent interaction. We propose ReactMotion, a novel framework that, for the first time, enables end-to-end modeling of the joint relationships among text, audio, emotion, and motion, augmented with a preference learning objective to optimize reaction appropriateness. To support this approach, we introduce ReactMotionNet, a large-scale one-to-many dataset of listener motions, and design a preference-based evaluation protocol focused on reaction quality. Experimental results demonstrate that our method significantly outperforms both retrieval-based baselines and cascaded large language model approaches in terms of motion naturalness, diversity, and contextual responsiveness.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

Problem

Research questions and friction points this paper is trying to address.

Reactive Listener Motion

Speaker Utterance

Nonverbal Behavior

Motion Generation

Human Reaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

reactive listener motion

one-to-many generation

preference-based learning