InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to jointly generate realistic full-body motion driven by speech and conditioned on object interaction, primarily due to the absence of a unified modeling framework and high-quality multimodal datasets. To address this, we propose the first end-to-end speech–object–motion ternary generative framework. Our method employs a multi-stage diffusion model that jointly encodes linguistic input, object prompts, and motion in a shared embedding space. We introduce a generalized motion adaptation module and an adaptive conditional fusion mechanism to enable dynamic cross-modal alignment. Furthermore, we release the first large-scale motion dataset explicitly enhanced with object-interaction annotations. Experiments demonstrate state-of-the-art performance on both speech-driven gesture generation and object-conditioned motion synthesis. Generated motions exhibit high photorealism, strong object awareness, fine-grained editability, and cross-condition consistency—significantly improving controllability and flexibility.

Technology Category

Application Category

📝 Abstract
Generating realistic human motions that naturally respond to both spoken language and physical objects is crucial for interactive digital experiences. Current methods, however, address speech-driven gestures or object interactions independently, limiting real-world applicability due to a lack of integrated, comprehensive datasets. To overcome this, we introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation. We achieve this by employing a multi-stage training process to learn a unified motion, speech, and prompt embedding space. To support this, we curate a rich human-object interaction dataset, formed by augmenting an existing text-to-motion dataset with detailed object interaction annotations. Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition, which is then dynamically combined during inference. To address the imbalance between heterogeneous conditioning signals, we propose an adaptive fusion strategy, which dynamically reweights the conditioning signals during diffusion sampling. InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis, outperforming gesture-focused diffusion methods, yielding highly realistic, object-aware full-body motions with enhanced realism, flexibility, and control.
Problem

Research questions and friction points this paper is trying to address.

Generates realistic human motions responding to speech and objects
Integrates object-aware interactions with co-speech gesture generation
Addresses lack of unified datasets for combined speech-object tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage training for unified motion-speech-prompt embedding
Generalized Motion Adaptation Module for independent training and dynamic combination
Adaptive fusion strategy to reweight heterogeneous conditioning signals
🔎 Similar Papers
No similar papers found.
S
Sreehari Rajan
Machine Learning Lab, IIIT Hyderabad, India
K
Kunal Bhosikar
Machine Learning Lab, IIIT Hyderabad, India
Charu Sharma
Charu Sharma
International Institute of Information Technology Hyderabad
Geometric Machine LearningPoint CloudsGraph Representation LearningOptimal Transport