Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features

πŸ“… 2025-02-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In video-to-dynamic-4D-scene generation, static regions dominate the optimization process, leading to blurred dynamic details, texture distortion, and overfitting. To address this, we propose DS4Dβ€”the first framework to explicitly decouple dynamic and static features along the temporal dimension. Methodologically, DS4D introduces a Dynamic-Static Feature Decoupling (DSFD) module to isolate motion representations; a spatio-temporal similarity fusion (TSSF) module that adaptively aggregates multi-view dynamic information for improved motion modeling; and a Gaussian-based differentiable 4D reconstruction pipeline. Evaluated on real-world scene datasets, DS4D significantly enhances texture fidelity in dynamic regions and inter-frame motion consistency, achieving state-of-the-art performance on video-to-4D generation.

Technology Category

Application Category

πŸ“ Abstract
Recently, the generation of dynamic 3D objects from a video has shown impressive results. Existing methods directly optimize Gaussians using whole information in frames. However, when dynamic regions are interwoven with static regions within frames, particularly if the static regions account for a large proportion, existing methods often overlook information in dynamic regions and are prone to overfitting on static regions. This leads to producing results with blurry textures. We consider that decoupling dynamic-static features to enhance dynamic representations can alleviate this issue. Thus, we propose a dynamic-static feature decoupling module (DSFD). Along temporal axes, it regards the portions of current frame features that possess significant differences relative to reference frame features as dynamic features. Conversely, the remaining parts are the static features. Then, we acquire decoupled features driven by dynamic features and current frame features. Moreover, to further enhance the dynamic representation of decoupled features from different viewpoints and ensure accurate motion prediction, we design a temporal-spatial similarity fusion module (TSSF). Along spatial axes, it adaptively selects a similar information of dynamic regions. Hinging on the above, we construct a novel approach, DS4D. Experimental results verify our method achieves state-of-the-art (SOTA) results in video-to-4D. In addition, the experiments on a real-world scenario dataset demonstrate its effectiveness on the 4D scene. Our code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Decoupling dynamic and static features
Enhancing dynamic region representation
Improving 4D generation from video
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic-static feature decoupling
Temporal-spatial similarity fusion
Video-to-4D generation enhancement
πŸ”Ž Similar Papers
L
Liying Yang
Macau University of Science and Technology
C
Chen Liu
The University of Queensland
Zhenwei Zhu
Zhenwei Zhu
Macau University of Science and Technology
3D ReconstructionComputer VisionDeep Learning
A
Ajian Liu
Institute of Automation, Chinese Academy of Sciences (CASIA)
H
Hui Ma
Macau University of Science and Technology
J
Jian Nong
Macau University of Science and Technology
Y
Yanyan Liang
Macau University of Science and Technology