Large-scale visual SLAM for in-the-wild videos

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Camera tracking failures and reconstruction discontinuities frequently occur in unconstrained monocular videos captured in-the-wild, primarily due to rapid camera rotation, pure forward motion, texture scarcity, and dynamic objects. Method: We propose a highly robust end-to-end monocular SLAM framework. It innovatively integrates self-calibrating intrinsic estimation, learning-based dynamic object masking, depth-guided regularized bundle adjustment (BA), and bag-of-words–driven global loop closure optimization. Stability in low-parallax and texture-poor scenes is further enhanced via SfM-based initialization, monocular depth priors, and consistency-aware BA. Results: Our method achieves large-scale, temporally coherent, and geometrically consistent 3D reconstructions on multiple long-duration real-world videos. Compared to state-of-the-art baselines, it significantly eliminates local fragmentation and geometric distortions, and consistently outperforms them in NeRF re-rendering quality, map completeness, and real-time performance.

Technology Category

Application Category

📝 Abstract

Accurate and robust 3D scene reconstruction from casual, in-the-wild videos can significantly simplify robot deployment to new environments. However, reliable camera pose estimation and scene reconstruction from such unconstrained videos remains an open challenge. Existing visual-only SLAM methods perform well on benchmark datasets but struggle with real-world footage which often exhibits uncontrolled motion including rapid rotations and pure forward movements, textureless regions, and dynamic objects. We analyze the limitations of current methods and introduce a robust pipeline designed to improve 3D reconstruction from casual videos. We build upon recent deep visual odometry methods but increase robustness in several ways. Camera intrinsics are automatically recovered from the first few frames using structure-from-motion. Dynamic objects and less-constrained areas are masked with a predictive model. Additionally, we leverage monocular depth estimates to regularize bundle adjustment, mitigating errors in low-parallax situations. Finally, we integrate place recognition and loop closure to reduce long-term drift and refine both intrinsics and pose estimates through global bundle adjustment. We demonstrate large-scale contiguous 3D models from several online videos in various environments. In contrast, baseline methods typically produce locally inconsistent results at several points, producing separate segments or distorted maps. In lieu of ground-truth pose data, we evaluate map consistency, execution time and visual accuracy of re-rendered NeRF models. Our proposed system establishes a new baseline for visual reconstruction from casual uncontrolled videos found online, demonstrating more consistent reconstructions over longer sequences of in-the-wild videos than previously achieved.

Problem

Research questions and friction points this paper is trying to address.

Robust 3D reconstruction from uncontrolled in-the-wild videos

Overcoming challenges in camera pose estimation for dynamic scenes

Improving visual SLAM accuracy in textureless and low-parallax regions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically recovers camera intrinsics using structure-from-motion

Masks dynamic objects with predictive model

Regularizes bundle adjustment with monocular depth estimates

🔎 Similar Papers

No similar papers found.