🤖 AI Summary
This work addresses the challenge of generating high-fidelity, temporally coherent dynamic 4D content from a single input image. We propose a novel framework integrating multi-view synthesis with deformable 4D Gaussian Splatting (4D GS). Our core innovation is a lightweight image-matrix module that jointly models spatiotemporal consistency, enabling smooth temporal deformation of 3D Gaussian point clouds and effectively mitigating motion discontinuities and background degradation. To enhance geometric accuracy and visual realism, we introduce CLIP-guided semantic constraints alongside joint PSNR/FVD optimization. Evaluated on the Objaverse dataset, our method achieves state-of-the-art performance across CLIP-I, PSNR, and FVD metrics. It significantly reduces flickering artifacts, preserves fine structural details, and improves inference efficiency compared to existing approaches.
📝 Abstract
Advances in generative modeling have significantly enhanced digital content creation, extending from 2D images to complex 3D and 4D scenes. Despite substantial progress, producing high-fidelity and temporally consistent dynamic 4D content remains a challenge. In this paper, we propose MVG4D, a novel framework that generates dynamic 4D content from a single still image by combining multi-view synthesis with 4D Gaussian Splatting (4D GS). At its core, MVG4D employs an image matrix module that synthesizes temporally coherent and spatially diverse multi-view images, providing rich supervisory signals for downstream 3D and 4D reconstruction. These multi-view images are used to optimize a 3D Gaussian point cloud, which is further extended into the temporal domain via a lightweight deformation network. Our method effectively enhances temporal consistency, geometric fidelity, and visual realism, addressing key challenges in motion discontinuity and background degradation that affect prior 4D GS-based methods. Extensive experiments on the Objaverse dataset demonstrate that MVG4D outperforms state-of-the-art baselines in CLIP-I, PSNR, FVD, and time efficiency. Notably, it reduces flickering artifacts and sharpens structural details across views and time, enabling more immersive AR/VR experiences. MVG4D sets a new direction for efficient and controllable 4D generation from minimal inputs.