🤖 AI Summary
This work addresses the challenging novel view synthesis (NVS) problem for long, unstructured videos captured by non-professional users—characterized by irregular camera motion, unknown poses, and large-scale, complex scenes. We propose a calibration-free, incremental joint optimization framework that: (1) simultaneously optimizes camera poses and 3D Gaussian splatting representations; (2) incorporates a learned 3D geometric prior for robust pose initialization; and (3) introduces a spatial-density-aware octree-based anchor construction mechanism to enable efficient organization and rendering of massive point clouds. Evaluated on multiple challenging long-video benchmarks, our method achieves state-of-the-art performance in rendering quality, pose accuracy, and computational efficiency. To the best of our knowledge, it is the first approach to enable high-fidelity, long-duration NVS without any auxiliary information (e.g., IMU data, depth sensors, or pre-calibrated cameras).
📝 Abstract
LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: https://linjohnss.github.io/longsplat/