SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenging problem of generating high-fidelity explicit 4D mesh animations from monocular videos, overcoming the scarcity of large-scale 4D mesh supervision data. We propose the first parameter-free, sliding-window 4D Transformer architecture, enabling spatiotemporal joint modeling for arbitrarily long videos. By integrating implicit geometric optimization with motion trajectory estimation, our method recovers global translation without additional parameters or 4D supervision—significantly improving geometric temporal consistency under static-camera settings. Our approach is compatible with any DiT-based image-to-3D generator and requires only <10 seconds of per-video fine-tuning. Evaluated on C4D, Objaverse, and in-the-wild videos, it achieves state-of-the-art temporal smoothness and high-fidelity 4D reconstruction.

Technology Category

Application Category

📝 Abstract

Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/

Problem

Research questions and friction points this paper is trying to address.

Converts monocular videos into high-quality 4D animated meshes

Minimizes reliance on large-scale 4D datasets for training

Enables 4D reconstruction from arbitrary-length videos efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sliding-Window Transformer for lossless temporal 4D generation

Integrates with Diffusion Transformer image-to-3D generators

Optimization-based trajectory module for global translation recovery

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency