Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation

πŸ“… 2025-08-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address spatiotemporal inconsistency, detail degradation, and poor alignment with user intent in monocular video-based high-fidelity 4D content generation, this paper proposes a diffusion-integrated 4D Gaussian splatting framework. Methodologically, it unifies multi-view rendering with 4D Gaussian splatting modeling, incorporates an inconsistency-aware optimization strategy, and employs an asymmetric U-Net–driven video diffusion model to enable fine-grained spatiotemporal control and text/image-guided generation and editing. The core contribution lies in embedding diffusion priors directly into the 4D dynamic geometry representation, thereby jointly resolving consistency, fidelity, and controllability challenges. Evaluated on public benchmarks, our approach achieves state-of-the-art performance, significantly improving quality and interactivity in digital human reconstruction, AR/VR content synthesis, and instruction-driven editing.

Technology Category

Application Category

πŸ“ Abstract
Generating high-quality 4D content from monocular videos for applications such as digital humans and AR/VR poses challenges in ensuring temporal and spatial consistency, preserving intricate details, and incorporating user guidance effectively. To overcome these challenges, we introduce Splat4D, a novel framework enabling high-fidelity 4D content generation from a monocular video. Splat4D achieves superior performance while maintaining faithful spatial-temporal coherence by leveraging multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement. Through extensive evaluations on public benchmarks, Splat4D consistently demonstrates state-of-the-art performance across various metrics, underscoring the efficacy of our approach. Additionally, the versatility of Splat4D is validated in various applications such as text/image conditioned 4D generation, 4D human generation, and text-guided content editing, producing coherent outcomes following user instructions.
Problem

Research questions and friction points this paper is trying to address.

Ensuring temporal and spatial consistency in 4D content creation
Preserving intricate details in monocular video-based 4D generation
Incorporating user guidance effectively for high-fidelity 4D outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages multi-view rendering for consistency
Uses video diffusion model for refinement
Employs asymmetric U-Net for detail enhancement
πŸ”Ž Similar Papers
No similar papers found.
M
Minghao Yin
The University of Hong Kong, Hong Kong
Yukang Cao
Yukang Cao
Research Fellow, Nanyang Technological University
3D computer vision
Songyou Peng
Songyou Peng
Google DeepMind
Computer VisionMachine Learning
K
Kai Han
The University of Hong Kong, Hong Kong