SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

πŸ“… 2025-11-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing hand-object interaction (HOI) generation methods rely on single-view inputs, leading to incomplete 3D geometric perception, motion distortion, and poor generalization to real-world scenarios; meanwhile, 3D HOI approaches requiring high-fidelity 3D annotations suffer from limited practicality. This paper proposes the first framework jointly generating multi-view HOI videos and 4D dynamic motion sequences. It introduces a Multi-View Joint Diffusion (MJD) model and a Diffusion-based Point Aligner (DPA), synergistically integrating 2D visual priors with 4D motion dynamics. A novel video–point-trajectory closed-loop co-optimization mechanism is devised to enforce consistency across appearance, motion, and geometry. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches in visual realism, motion plausibility, and multi-view consistency, while generalizing effectively to real scenes without requiring any 3D annotations.

Technology Category

Application Category

πŸ“ Abstract
Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.
Problem

Research questions and friction points this paper is trying to address.

Single-view HOI generation causes geometric distortions and unrealistic motions
3D HOI methods lack generalization due to lab-captured data dependency
Need unified approach for synchronized multi-view video and 4D motion generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly generates synchronized multi-view HOI videos
Refines coarse motion into aligned 4D metric tracks
Establishes closed-loop cycle between appearance and dynamics
πŸ”Ž Similar Papers
No similar papers found.