🤖 AI Summary
Existing geometric reconstruction methods struggle to efficiently process minute-long videos due to the quadratic complexity of attention mechanisms or insufficient memory capacity in recurrent architectures. This work proposes a chunked bidirectional inference framework that integrates a learned hybrid memory mechanism—combining global parametric memory via test-time training (TTT) with local non-parametric memory from sliding window attention (SWA)—to achieve high-fidelity reconstruction within chunks while preserving cross-chunk geometric consistency. Notably, the method requires no post-optimization and is the first to generalize from training on 128-frame sequences to inference on videos spanning thousands of frames. It reduces ATE error by over 74% on KITTI and achieves robust, globally consistent dense 3D reconstructions on the 19k-frame VBR dataset, significantly outperforming existing feedforward approaches.
📝 Abstract
Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.