VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

๐Ÿ“… 2025-04-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing listener head motion modeling in virtual dialogues lacks fine-grained emotional expressiveness over long sequences, precise motion controllability, and large-scale, fine-grained multimodal paired data. Method: We propose ListenerXโ€”the first speaker-listener paired dataset containing 1.4M frames, annotated jointly with text-based facial expression descriptions, emotion intensity scores, and 3D head poses. We design a Responsive Interaction Module (RIM) for semantic-motion fine-grained alignment and introduce Emotion Intensity Tags (EIT) to jointly regulate textual semantics and motion amplitude. Our approach integrates multimodal conditional guidance with adaptive interactive embedding modeling. Contribution/Results: ListenerX enables the first cross-modal, editable, expressive, and controllable listener motion generation. Evaluated on ListenerX, our method achieves state-of-the-art performance, significantly improving emotional authenticity, motion diversity, and text-motion consistency.

Technology Category

Application Category

๐Ÿ“ Abstract
Generating responsive listener head dynamics with nuanced emotions and expressive reactions is crucial for practical dialogue modeling in various virtual avatar animations. Previous studies mainly focus on the direct short-term production of listener behavior. They overlook the fine-grained control over motion variations and emotional intensity, especially in long-sequence modeling. Moreover, the lack of long-term and large-scale paired speaker-listener corpora including head dynamics and fine-grained multi-modality annotations (e.g., text-based expression descriptions, emotional intensity) also limits the application of dialogue modeling.Therefore, we first newly collect a large-scale multi-turn dataset of 3D dyadic conversation containing more than 1.4M valid frames for multi-modal responsive interaction, dubbed ListenerX. Additionally, we propose VividListener, a novel framework enabling fine-grained, expressive and controllable listener dynamics modeling. This framework leverages multi-modal conditions as guiding principles for fostering coherent interactions between speakers and listeners.Specifically, we design the Responsive Interaction Module (RIM) to adaptively represent the multi-modal interactive embeddings. RIM ensures the listener dynamics achieve fine-grained semantic coordination with textual descriptions and adjustments, while preserving expressive reaction with speaker behavior. Meanwhile, we design the Emotional Intensity Tags (EIT) for emotion intensity editing with multi-modal information integration, applying to both text descriptions and listener motion amplitude.Extensive experiments conducted on our newly collected ListenerX dataset demonstrate that VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.
Problem

Research questions and friction points this paper is trying to address.

Generating nuanced emotional listener head dynamics for dialogue modeling
Lacking fine-grained control over motion and emotional intensity variations
Absence of large-scale paired speaker-listener datasets with multi-modal annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collects large-scale multi-turn 3D dyadic conversation dataset
Proposes VividListener for fine-grained controllable dynamics
Uses multi-modal conditions for coherent speaker-listener interactions
๐Ÿ”Ž Similar Papers
No similar papers found.
Shiying Li
Shiying Li
Beijing University of Posts and Telecommunications, Beijing, China
Xingqun Qi
Xingqun Qi
The Hong Kong University of Science and Technology (HKUST)
Computer VisionHuman Motion ModelingMedical Image Analysis
B
Bingkun Yang
Beijing University of Posts and Telecommunications, Beijing, China
C
Chen Weile
Beijing University of Posts and Telecommunications, Beijing, China
Z
Zezhao Tian
Beijing University of Posts and Telecommunications, Beijing, China
Muyi Sun
Muyi Sun
School of AI, BUPT (<< NLPR CASIA << BUPT)
Multi-Modality LearningComputer VisionBiometricsMedical Image Analysis
Q
Qifeng Liu
Hong Kong University of Science and Technology, Hong Kong, China
M
Man Zhang
Beijing University of Posts and Telecommunications, Beijing, China
Zhenan Sun
Zhenan Sun
Institute of Automation, Chinese Academy of Sciences
BiometricsPattern RecognitionComputer Vision