StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the severe information loss in dense 3D reconstruction from video streams under fixed memory budgets, caused by binary token pruning and noisy single-layer scoring in existing methods. The authors propose a training-free streaming visual geometry Transformer framework that jointly optimizes cache management through cross-layer consistency-enhanced importance scoring and a three-stage hybrid cache compression strategy, preserving critical geometric context while improving long-term stability. Robust cache updates are achieved by integrating cross-layer trajectory tracking, sequential statistical analysis, and nearest-neighbor merging on the key vector manifold. Evaluated on five benchmarks—7-Scenes, NRGBD, ETH3D, Bonn, and KITTI—the method establishes new state-of-the-art performance, significantly enhancing reconstruction accuracy and robustness under constant computational constraints.

Technology Category

Application Category

📝 Abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

Problem

Research questions and friction points this paper is trying to address.

streaming 3D reconstruction

constant memory budget

token eviction

geometric information preservation

video stream

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Layer Consistency-Enhanced Scoring

Hybrid Cache Compression

Streaming Visual Geometry Transformers