FrameVGGT: Frame Evidence Rolling Memory for streaming VGGT

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the unbounded growth of KV cache in streaming visual geometry Transformers over long sequences and the resulting incoherent geometric support under fixed memory constraints. To this end, the authors propose a frame-driven rolling explicit memory framework that treats incremental per-frame KV states as coherent evidence blocks. These blocks are managed through prototype-based compression and stored in a fixed-capacity mid-term memory bank, while a lightweight anchor mechanism mitigates long-term degradation. Innovatively reframing bounded memory strategies from a geometric support perspective, the approach replaces conventional token-level retention with frame-level evidence blocks. Experiments on long-sequence 3D reconstruction, video depth estimation, and camera pose estimation demonstrate significant improvements in the trade-off between accuracy and memory usage, as well as enhanced long-term geometric consistency.

Technology Category

Application Category

📝 Abstract

Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception but suffer from unbounded KV-cache growth, which limits deployment over long streams. We revisit bounded-memory streaming from the perspective of geometric support. In geometry-driven reasoning, memory quality depends not only on how many tokens are retained, but also on whether the retained memory still preserves sufficiently coherent local support. This suggests that token-level retention may become less suitable under fixed budgets, as it can thin the evidence available within each contributing frame and make subsequent fusion more sensitive to weakly aligned history. Motivated by this observation, we propose FrameVGGT, a frame-driven rolling explicit-memory framework that treats each frame's incremental KV contribution as a coherent evidence block. FrameVGGT summarizes each block into a compact prototype and maintains a fixed-capacity mid-term bank of complementary frame blocks under strict budgets, with an optional lightweight anchor tier for rare prolonged degradation. Across long-sequence 3D reconstruction, video depth estimation, and camera pose benchmarks, FrameVGGT achieves favorable accuracy--memory trade-offs under bounded memory, while maintaining more stable geometry over long streams.

Problem

Research questions and friction points this paper is trying to address.

streaming VGGT

KV-cache growth

bounded-memory

3D perception

geometric support

Innovation

Methods, ideas, or system contributions that make the work stand out.

FrameVGGT

streaming transformer

bounded memory