Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dynamic 4D object generation methods suffer from severe inter-view and temporal inconsistency in appearance and motion under sparse input, along with pronounced temporal drift and artifacts. Method: We propose the first two-stage 4D generation framework integrating motion priors from a Foundation Point Tracker (FPT). It introduces dense feature-level correspondences—provided by FPT—into the 4D generation pipeline, jointly optimizing diffusion-based multi-view video synthesis and hybrid 4D Gaussian splatting reconstruction. Geometry-appearance co-modeling is achieved via Hex-plane encoding and 4D spherical harmonics. Contribution/Results: Our method achieves state-of-the-art performance on multi-view video and 4D generation benchmarks, yielding outputs with strong temporal stability, cross-view consistency, and text controllability. To foster community advancement, we release Sketchfab28—a high-quality, large-scale 4D dataset—specifically curated for 4D content generation research.

Technology Category

Application Category

📝 Abstract
Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance. We present emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that curb appearance drift and enhance cross-view coherence. In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding that concatenates co-located diffusion features (carrying Stage-One tracking priors) with Hex-plane features, and augment them with 4D Spherical Harmonics for higher-fidelity dynamics modeling. emph{Track4DGen} surpasses baselines on both multi-view video generation and 4D generation benchmarks, yielding temporally stable, text-editable 4D assets. Lastly, we curate emph{Sketchfab28}, a high-quality dataset for benchmarking object-centric 4D generation and fostering future research.
Problem

Research questions and friction points this paper is trying to address.

Generating dynamic 4D objects from sparse inputs while preserving appearance and motion coherence.
Addressing view discrepancy and temporal drift in 4D generation due to inadequate tracking guidance.
Enhancing cross-view coherence and temporal stability in text-editable 4D asset creation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework with diffusion model and tracker
Injecting tracker motion priors into diffusion features
Hybrid 4D Gaussian Splatting with Hex-plane and SH features
🔎 Similar Papers
No similar papers found.