🤖 AI Summary
This work addresses the severe information loss in dense 3D reconstruction from video streams under fixed memory budgets, caused by binary token pruning and noisy single-layer scoring in existing methods. The authors propose a training-free streaming visual geometry Transformer framework that jointly optimizes cache management through cross-layer consistency-enhanced importance scoring and a three-stage hybrid cache compression strategy, preserving critical geometric context while improving long-term stability. Robust cache updates are achieved by integrating cross-layer trajectory tracking, sequential statistical analysis, and nearest-neighbor merging on the key vector manifold. Evaluated on five benchmarks—7-Scenes, NRGBD, ETH3D, Bonn, and KITTI—the method establishes new state-of-the-art performance, significantly enhancing reconstruction accuracy and robustness under constant computational constraints.
📝 Abstract
Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.