Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of feedforward models in large-scale 3D scene reconstruction from long video sequences, where restricted memory and lack of global context lead to insufficient accuracy and temporal inconsistency. To overcome this, we propose a lightweight neural subnetwork that constructs a global context representation, which is efficiently adapted at test time via self-supervised objectives to compress and preserve long-range scene information. This approach significantly enhances the model’s ability to maintain long-term consistency over extremely large scenes without incurring additional computational overhead. Evaluated on large-scale benchmarks including KITTI Odometry and Oxford Spires, our method achieves state-of-the-art performance in both pose estimation accuracy and 3D reconstruction quality while retaining high efficiency.
📝 Abstract
This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.
Problem

Research questions and friction points this paper is trying to address.

3D reconstruction
large-scale scenes
long video sequences
global context
reconstruction consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time training
global context representation
large-scale 3D reconstruction
self-supervised adaptation
neural scene representation
🔎 Similar Papers
No similar papers found.