Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of unnatural human motion, distorted object dynamics, and weak interaction plausibility in text-driven 3D human-object interaction generation, which stem from the cross-modal gap between language and 3D motion. To overcome these limitations, we propose a generative framework that integrates priors from multimodal large language models. Our approach introduces a novel object representation combining geometric keypoints and contact-aware features, a modality-aware mixture-of-experts (MoE) fusion strategy, and a cascaded diffusion mechanism with explicit interaction supervision. Experimental results demonstrate that our method significantly outperforms existing approaches in generating high-fidelity, fine-grained 3D human-object interactions, effectively enhancing the naturalness of both human and object motion as well as the physical plausibility of their interactions.

Technology Category

Application Category

📝 Abstract
We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.
Problem

Research questions and friction points this paper is trying to address.

text-driven 3D generation
human-object interaction
cross-modality gap
motion generation
multimodal priors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Priors
Human-Object Interaction
Mixture-of-Experts
Cascaded Diffusion
3D Motion Generation
🔎 Similar Papers
No similar papers found.
Yin Wang
Yin Wang
Beihang University
Human Motion GenerationMultimodal Learning
Z
Ziyao Zhang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Zhiying Leng
Zhiying Leng
Beihang University | Technische Universität München
Hand Pose EstimationGraph Neural NetworkSemantic Segmentation
H
Haitian Liu
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
F
Frederick W. B. Li
Department of Computer Science, University of Durham, U.K
M
Mu Li
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Xiaohui Liang
Xiaohui Liang
University of Massachusetts Boston
Mobile HealthcareVoice TechnologyInternet of ThingsPrivacy