LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitive computational cost and high memory footprint of 3D vision foundation models (e.g., VGGT) when processing long image sequences (hundreds to thousands of frames), this paper proposes a geometry-aware, lightweight decoupling method. The core insight is that local image tokens exhibit strong geometric correlations and cross-layer stability; leveraging this, we design a cache-efficient token merging strategy. We further optimize anchor selection via geometric importance analysis and enable cross-layer reuse of merging indices. The method supports FP8 quantization and seamless fine-tuning of the original model. Experiments demonstrate up to 10× inference speedup and substantial memory compression, while preserving the original model’s accuracy on large-scale scene reconstruction tasks involving up to 1,000 images. The approach is both scalable—adapting to varying sequence lengths—and robust across diverse geometric configurations.

Technology Category

Application Category

📝 Abstract
3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/
Problem

Research questions and friction points this paper is trying to address.

Accelerates 3D reconstruction models for large-scale scenes
Reduces computational redundancy via geometry-aware token merging
Enables efficient processing of long image sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-aware cached token merging for efficiency
Optimized anchor token selection preserving key information
Cached reusable merge indices across network layers
🔎 Similar Papers
Z
Zhijian Shu
Nanjing University of Posts and Telecommunications
C
Cheng Lin
Macau University of Science and Technology
T
Tao Xie
Horizon Robotics
Wei Yin
Wei Yin
Staff Research Scientist, Horizon Robotics
World ModelGenerative AIPhysical AI
B
Ben Li
China Mobile Zijin Innovation Institute
Z
Zhiyuan Pu
China Mobile Zijin Innovation Institute
W
Weize Li
TARS Robotics
Y
Yao Yao
Nanjing University
Xun Cao
Xun Cao
Nanjing University
Computational PhotographyComputational ImagingImage & Video Processing
Xiaoyang Guo
Xiaoyang Guo
Florida State University
Statistical Shape AnalysisGraphComputer VisionMachine Learning
Xiao-Xiao Long
Xiao-Xiao Long
Associate Professor at Nanjing University; AnySyn3D
3D VisionGenerative AISpatial IntelligenceEmbodied AI