SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing hand-object interaction (HOI) generation methods suffer from two key limitations: (1) 3D motion generation relies on predefined object models and constrained lab-captured data, resulting in poor generalizability; (2) video generation prioritizes pixel-level photorealism while neglecting physical plausibility. To address these, we propose the first unified diffusion-based framework for joint video and 3D motion generation. Our approach integrates trimodal adaptive modulation and a 3D full-attention mechanism, enabling a vision-perception-driven, closed-loop interactive generation pipeline. Crucially, it requires neither pre-specified object models nor annotated action data, achieving simultaneous synthesis of physically plausible and visually realistic HOI across diverse scenes. Extensive evaluation demonstrates substantial improvements in video-motion consistency and dynamic plausibility on unseen real-world scenarios, outperforming state-of-the-art methods across multiple quantitative metrics.

Technology Category

Application Category

📝 Abstract

Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at https://github.com/Droliven/SViMo_project.

Problem

Research questions and friction points this paper is trying to address.

Overcoming reliance on predefined 3D object models and lab-captured motion data in HOI generation

Balancing visual fidelity and physical plausibility in HOI video generation

Integrating heterogeneous semantics, appearance, and motion features for synchronized video-motion output

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synchronized diffusion for video and motion generation

Tri-modal adaptive modulation for feature alignment

Vision-aware 3D interaction diffusion model

🔎 Similar Papers

No similar papers found.