🤖 AI Summary
This work addresses the challenge of achieving high geometric accuracy, temporal consistency, and computational efficiency simultaneously in real-time 3D reconstruction from video streams. To this end, the authors propose LingBot-Map, a feedforward SLAM-based 3D foundation model that leverages a novel Geometry Context Transformer (GCT) architecture. The GCT integrates anchor context, pose reference windows, and trajectory memory mechanisms to enable efficient streaming inference under a compact state representation. This approach effectively tackles three core challenges: coordinate alignment, dense geometric modeling, and long-term drift correction. Evaluated across multiple benchmarks, LingBot-Map significantly outperforms existing streaming and iterative methods, demonstrating robust performance on sequences exceeding 10,000 frames at approximately 20 FPS with a resolution of 518×378.
📝 Abstract
Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal
consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation
model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully
designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and
long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around
20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach
achieves superior performance compared to both existing streaming and iterative optimization-based approaches.