FrameVGGT: Frame Evidence Rolling Memory for streaming VGGT

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the unbounded growth of KV cache in streaming visual geometry Transformers over long sequences and the resulting incoherent geometric support under fixed memory constraints. To this end, the authors propose a frame-driven rolling explicit memory framework that treats incremental per-frame KV states as coherent evidence blocks. These blocks are managed through prototype-based compression and stored in a fixed-capacity mid-term memory bank, while a lightweight anchor mechanism mitigates long-term degradation. Innovatively reframing bounded memory strategies from a geometric support perspective, the approach replaces conventional token-level retention with frame-level evidence blocks. Experiments on long-sequence 3D reconstruction, video depth estimation, and camera pose estimation demonstrate significant improvements in the trade-off between accuracy and memory usage, as well as enhanced long-term geometric consistency.

Technology Category

Application Category

📝 Abstract
Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception but suffer from unbounded KV-cache growth, which limits deployment over long streams. We revisit bounded-memory streaming from the perspective of geometric support. In geometry-driven reasoning, memory quality depends not only on how many tokens are retained, but also on whether the retained memory still preserves sufficiently coherent local support. This suggests that token-level retention may become less suitable under fixed budgets, as it can thin the evidence available within each contributing frame and make subsequent fusion more sensitive to weakly aligned history. Motivated by this observation, we propose FrameVGGT, a frame-driven rolling explicit-memory framework that treats each frame's incremental KV contribution as a coherent evidence block. FrameVGGT summarizes each block into a compact prototype and maintains a fixed-capacity mid-term bank of complementary frame blocks under strict budgets, with an optional lightweight anchor tier for rare prolonged degradation. Across long-sequence 3D reconstruction, video depth estimation, and camera pose benchmarks, FrameVGGT achieves favorable accuracy--memory trade-offs under bounded memory, while maintaining more stable geometry over long streams.
Problem

Research questions and friction points this paper is trying to address.

streaming VGGT
KV-cache growth
bounded-memory
3D perception
geometric support
Innovation

Methods, ideas, or system contributions that make the work stand out.

FrameVGGT
streaming transformer
bounded memory
geometric coherence
KV-cache compression
🔎 Similar Papers
No similar papers found.
Z
Zhisong Xu
Institute of Industrial Science, The University of Tokyo, Tokyo, Japan
Takeshi Oishi
Takeshi Oishi
Associate Professor of Institute of Industrial Science, The University of Tokyo
Computer Vision