Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenges of aligning moving parts with articulated joint structures and maintaining cross-view/temporal consistency in monocular RGB video. We propose KinematicDiffuse, a dual-branch diffusion framework. Methodologically: (i) we design spatial color encoding to embed part-level semantics into the image generation process; (ii) we introduce a bidirectional diffusion fusion (BiDiFuse) module coupled with a contrastive part-consistency loss to jointly model cross-branch collaboration between RGB and part segmentation modalities; (iii) we employ latent VAE sharing and unsupervised refinement to enable interpretable mapping from 2D parts to 3D skeletal poses and skinning weights. Evaluated on KinematicParts20K, our method demonstrates strong generalization to real-world videos, novel objects, and rare poses. Generated outputs exhibit both high visual fidelity and kinematically plausible motion, directly supporting downstream applications such as animation rigging and biomechanical analysis.

Technology Category

Application Category

📝 Abstract

We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

Problem

Research questions and friction points this paper is trying to address.

Generating paired RGB and kinematic part videos from monocular inputs

Learning structural components aligned with object articulation across views

Enabling 2D part maps to be lifted to 3D skeletal structures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch diffusion model synthesizes RGB and part videos

Spatial color encoding enables flexible part segmentation sharing

Bidirectional Diffusion Fusion module ensures cross-branch consistency

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency