SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses 4D spatiotemporal content synthesis from unlabeled, freely captured videos—without requiring manual camera pose annotations or explicit 3D supervision. To overcome limitations of existing methods—such as reliance on pose priors and ambiguity between camera motion and scene dynamics—we propose a “trajectory-to-camera” generative framework. It decouples camera motion from scene dynamics via a fixed virtual camera array and employs a view-conditioned autoregressive video inpainting model for coherent spatiotemporal modeling. Geometric priors are learned from synthetically degraded training data, while virtual camera spline traversal enhances generalization across diverse motion patterns. Evaluated on cross-view video generation and sparse 3D reconstruction, our approach significantly outperforms pose- or trajectory-dependent methods, achieving superior visual quality and stronger generalization to unseen camera trajectories and dynamic scenes.

Technology Category

Application Category

📝 Abstract

Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

Problem

Research questions and friction points this paper is trying to address.

Generating 4D content without manual pose annotations

Separating camera control from dynamic scene modeling

Inpainting occluded regions across virtual viewpoints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose-free 4D generation using virtual camera banks

View-conditional video inpainting for geometry learning

Autoregressive inference via overlapping window traversal

🔎 Similar Papers

MagicPose4D: Crafting Articulated Models with Appearance and Motion Control