🤖 AI Summary
This work addresses the challenging problem of generating high-fidelity explicit 4D mesh animations from monocular videos, overcoming the scarcity of large-scale 4D mesh supervision data. We propose the first parameter-free, sliding-window 4D Transformer architecture, enabling spatiotemporal joint modeling for arbitrarily long videos. By integrating implicit geometric optimization with motion trajectory estimation, our method recovers global translation without additional parameters or 4D supervision—significantly improving geometric temporal consistency under static-camera settings. Our approach is compatible with any DiT-based image-to-3D generator and requires only <10 seconds of per-video fine-tuning. Evaluated on C4D, Objaverse, and in-the-wild videos, it achieves state-of-the-art temporal smoothness and high-fidelity 4D reconstruction.
📝 Abstract
Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/