Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge of generating high-fidelity dynamic 3D content (i.e., video-to-4D synthesis) from a single input video. To overcome the difficulties of joint spatiotemporal-geometric-appearance modeling and data scarcity, we propose the first end-to-end, instance-free diffusion framework for direct 4D generation. Our method introduces the Gaussian Splat Variational Field (GS-Variational Field), a VAE that jointly encodes 3D geometry, appearance, and motion into a compact latent space, and a temporal-aware Diffusion Transformer that enables efficient 4D synthesis over canonical Gaussian splats. Trained exclusively on synthetic data, the model exhibits strong generalization to real-world videos. It significantly outperforms existing approaches in both fidelity and controllability, enabling high-quality, animatable 4D content generation without per-instance fine-tuning.

Technology Category

Application Category

📝 Abstract

In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Generates dynamic 3D content from single videos

Compresses high-dimensional animations into latent space

Improves 4D synthesis quality and generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct 4DMesh-to-GS Variation Field VAE

Gaussian Variation Field diffusion model

Temporal-aware Diffusion Transformer

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency