SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of generating high-fidelity explicit 4D mesh animations from monocular videos, overcoming the scarcity of large-scale 4D mesh supervision data. We propose the first parameter-free, sliding-window 4D Transformer architecture, enabling spatiotemporal joint modeling for arbitrarily long videos. By integrating implicit geometric optimization with motion trajectory estimation, our method recovers global translation without additional parameters or 4D supervision—significantly improving geometric temporal consistency under static-camera settings. Our approach is compatible with any DiT-based image-to-3D generator and requires only <10 seconds of per-video fine-tuning. Evaluated on C4D, Objaverse, and in-the-wild videos, it achieves state-of-the-art temporal smoothness and high-fidelity 4D reconstruction.

Technology Category

Application Category

📝 Abstract
Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/
Problem

Research questions and friction points this paper is trying to address.

Converts monocular videos into high-quality 4D animated meshes
Minimizes reliance on large-scale 4D datasets for training
Enables 4D reconstruction from arbitrary-length videos efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sliding-Window Transformer for lossless temporal 4D generation
Integrates with Diffusion Transformer image-to-3D generators
Optimization-based trajectory module for global translation recovery
🔎 Similar Papers
No similar papers found.
Kehong Gong
Kehong Gong
National University of Singapore
digital humandeep leanring
Z
Zhengyu Wen
Huawei Central Media Technology Institute
M
Mingxi Xu
Huawei Central Media Technology Institute
W
Weixia He
Huawei Central Media Technology Institute
Q
Qi Wang
Huawei Central Media Technology Institute
N
Ning Zhang
Huawei Central Media Technology Institute
Zhengyu Li
Zhengyu Li
Peking University
Quantum Cryptography
C
Chenbin Li
Huawei Central Media Technology Institute
Dongze Lian
Dongze Lian
Huawei Central Media Technology Institute
W
Wei Zhao
Huawei Central Media Technology Institute
X
Xiaoyu He
Huawei Central Media Technology Institute
M
Mingyuan Zhang
Huawei Central Media Technology Institute