MVG4D: Image Matrix-Based Multi-View and Motion Generation for 4D Content Creation from a Single Image

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating high-fidelity, temporally coherent dynamic 4D content from a single input image. We propose a novel framework integrating multi-view synthesis with deformable 4D Gaussian Splatting (4D GS). Our core innovation is a lightweight image-matrix module that jointly models spatiotemporal consistency, enabling smooth temporal deformation of 3D Gaussian point clouds and effectively mitigating motion discontinuities and background degradation. To enhance geometric accuracy and visual realism, we introduce CLIP-guided semantic constraints alongside joint PSNR/FVD optimization. Evaluated on the Objaverse dataset, our method achieves state-of-the-art performance across CLIP-I, PSNR, and FVD metrics. It significantly reduces flickering artifacts, preserves fine structural details, and improves inference efficiency compared to existing approaches.

Technology Category

Application Category

📝 Abstract
Advances in generative modeling have significantly enhanced digital content creation, extending from 2D images to complex 3D and 4D scenes. Despite substantial progress, producing high-fidelity and temporally consistent dynamic 4D content remains a challenge. In this paper, we propose MVG4D, a novel framework that generates dynamic 4D content from a single still image by combining multi-view synthesis with 4D Gaussian Splatting (4D GS). At its core, MVG4D employs an image matrix module that synthesizes temporally coherent and spatially diverse multi-view images, providing rich supervisory signals for downstream 3D and 4D reconstruction. These multi-view images are used to optimize a 3D Gaussian point cloud, which is further extended into the temporal domain via a lightweight deformation network. Our method effectively enhances temporal consistency, geometric fidelity, and visual realism, addressing key challenges in motion discontinuity and background degradation that affect prior 4D GS-based methods. Extensive experiments on the Objaverse dataset demonstrate that MVG4D outperforms state-of-the-art baselines in CLIP-I, PSNR, FVD, and time efficiency. Notably, it reduces flickering artifacts and sharpens structural details across views and time, enabling more immersive AR/VR experiences. MVG4D sets a new direction for efficient and controllable 4D generation from minimal inputs.
Problem

Research questions and friction points this paper is trying to address.

Generating dynamic 4D content from single image
Enhancing temporal consistency and visual realism
Addressing motion discontinuity and background degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses image matrix for multi-view synthesis
Combines 4D Gaussian Splatting with deformation
Enhances temporal consistency and visual realism
X
Xiaotian Chen
Shenzhen University
D
Dongfu Yin
GuanXugdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen)
F
Fei Richard Yu
GuanXugdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen)
Xuanchen Li
Xuanchen Li
Shanghai Jiao Tong University
Digital HumanHumanoid RobotAIGCComputer VisionImage Restoration
Xinhao Zhang
Xinhao Zhang
PHD student, Portland State University
Data MiningReinforcement Learning