Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors

πŸ“… 2025-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing 3D human–object interaction (HOI) synthesis methods are constrained by the scarcity and high annotation cost of 3D HOI data, resulting in limited coverage of interaction types and object categories, and difficulty balancing physical plausibility with semantic diversity. This paper introduces the first zero-shot 3D HOI synthesis framework, generating temporally coherent and physically plausible HOI sequences solely from textual descriptions. Our approach features: (1) a zero-shot generation paradigm leveraging multimodal large model priors; (2) a category-agnostic 6-DoF object pose estimation method; and (3) a physics-engine-based joint optimization of motion and pose. Crucially, the method requires no 3D HOI annotations, supports open-vocabulary inputs, and exhibits strong generalization to unseen objects and actions. Experiments demonstrate significant improvements over prior work in both physical realism and semantic diversity.

Technology Category

Application Category

πŸ“ Abstract
Human-object interaction (HOI) synthesis is important for various applications, ranging from virtual reality to robotics. However, acquiring 3D HOI data is challenging due to its complexity and high cost, limiting existing methods to the narrow diversity of object types and interaction patterns in training datasets. This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. The core idea of our method lies in leveraging extensive HOI knowledge from pre-trained Multimodal Models. Given a text description, our system first obtains temporally consistent 2D HOI image sequences using image or video generation models, which are then uplifted to 3D HOI milestones of human and object poses. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images. Our estimation method is adaptive to various object templates obtained from text-to-3D models or online retrieval. A physics-based tracking of the 3D HOI kinematic milestone is further applied to refine both body motions and object poses, yielding more physically plausible HOI generation results. The experimental results demonstrate that our method is capable of generating open-vocabulary HOIs with physical realism and semantic diversity.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing 3D human-object interactions without training data
Leveraging pre-trained models for diverse interaction generation
Ensuring physical realism in zero-shot HOI synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained Multimodal Models knowledge
Uses 2D-to-3D uplift with pose estimation
Applies physics-based tracking for refinement
πŸ”Ž Similar Papers
No similar papers found.
Yuke Lou
Yuke Lou
Peking University
Character AnimationComputer GraphicsComputer Vision
Y
Yiming Wang
ETH Zurich, Switzerland
Z
Zhen Wu
Stanford University, United States of America
R
Rui Zhao
Tencent, China
W
Wenjia Wang
The University of Hong Kong, China
Mingyi Shi
Mingyi Shi
The University of Hong Kong
character animationcomputer graphicsdeep learning
Taku Komura
Taku Komura
The University of Hong Kong
Character AnimationComputer GraphicsRobotics