🤖 AI Summary
Existing methods for dynamic 4D asset reconstruction either rely on category-specific priors or incur prohibitive optimization latency. To address this, we propose the first Transformer-based, feed-forward implicit temporal interpolation framework. Our method enforces causal consistency via a dedicated loss, jointly optimizing triplane feature interpolation and implicit neural representations to synthesize high-fidelity, deformable geometry and UV-consistent textured mesh sequences at arbitrary continuous time steps from sparse keyframes. We further integrate a diffusion model to enhance multi-view consistency. The framework enables end-to-end monocular video reconstruction with second-level inference speed. Evaluated on multiple dynamic datasets, it significantly outperforms FiLM and linear interpolation baselines. To our knowledge, this is the first approach achieving cross-category generalization, high fidelity, and industrial deployability in 4D reconstruction.
📝 Abstract
Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times $t_0$ and $t_1$, LIM produces a deformed shape at any continuous time $tin[t_0,t_1]$, delivering high-quality interpolated frames in seconds. Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.