Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the memory explosion and performance degradation of VGGT in long-sequence 3D reconstruction, caused by the quadratic complexity of global attention. To overcome this, we propose a streaming reconstruction approach that reformulates context construction as a retrieval task, dynamically selecting a fixed number of historically relevant frames based on query-key similarities from the first-layer attention—without requiring any additional learned scoring mechanism. Our method introduces a segment-wise sampling strategy to enhance information diversity and a pose-aware spatial memory mechanism for location-sensitive retrieval. This design maintains constant memory consumption while significantly outperforming StreamVGGT, TTT3R, and InfiniteVGGT, achieving state-of-the-art performance in long-sequence 3D reconstruction.
📝 Abstract
Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at https://github.com/zzctmd/RetrieveVGGT.
Problem

Research questions and friction points this paper is trying to address.

3D reconstruction
long context
memory efficiency
streaming
attention mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
query-key similarity retrieval
segment sampling
pose-aware spatial memory
long context 3D reconstruction
🔎 Similar Papers