🤖 AI Summary
Modern recurrent neural networks (RNNs) offer computational efficiency in 3D reconstruction due to their linear time complexity, yet suffer from poor length generalization—failing to generalize to sequences longer than those seen during training. To address this, we formulate 3D reconstruction as a test-time training problem and propose a lightweight, training-free online learning framework. Our method introduces a closed-form learning rate derived from memory-state–observation alignment confidence, dynamically balancing historical information retention and adaptation to new observations. Coupled with a GPU-optimized recurrent architecture and test-time memory update mechanisms, the framework enables low-memory, high-frame-rate inference. On global pose estimation, our approach achieves a 2× accuracy improvement over baselines, operates at 20 FPS using only 6 GB of GPU memory, and robustly handles sequences spanning several thousand frames.
📝 Abstract
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2 imes$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R