ViHOI: Human-Object Interaction Synthesis with Visual Priors

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generating realistic and physically plausible 3D human-object interaction (HOI) motions remains challenging, primarily because natural language often fails to fully capture complex physical constraints. This work proposes ViHOI, a novel framework that leverages visual and textual priors from 2D images to guide 3D HOI synthesis—marking the first approach to do so. Specifically, ViHOI extracts multimodal priors using large vision-language models, compresses these features via a Q-Former adapter, and injects them into a diffusion-based generative model through a layer-decoupled strategy. Extensive experiments demonstrate that ViHOI significantly outperforms existing methods across multiple benchmarks and exhibits strong generalization capabilities on unseen objects and interaction categories.

Technology Category

Application Category

📝 Abstract
Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.
Problem

Research questions and friction points this paper is trying to address.

Human-Object Interaction
3D Motion Generation
Physical Plausibility
Visual Priors
Diffusion Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-Object Interaction
Diffusion Model
Visual Prior
Vision-Language Model
Motion Generation
🔎 Similar Papers
No similar papers found.
S
Songjin Cai
South China University of Technology
L
Linjie Zhong
South China University of Technology
L
Ling Guo
South China University of Technology
Changxing Ding
Changxing Ding
Professor@South China University of Technology
Computer VisionEmbodied AI