Quantized Visual Geometry Grounded Transformer

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitive computational and memory overhead hindering deployment of Transformer-based 3D reconstruction models (e.g., VGGT), this paper introduces QuantVGGT—the first post-training quantization framework tailored for billion-parameter visual-geometric grounding Transformers. We tackle two key challenges: heavy-tailed activation distributions induced by data-agnostic special tokens, and instability in multi-view calibration sample selection. To this end, we propose a dual-smoothing fine-grained quantization mechanism, augmented with Hadamard rotation, channel-wise smoothing, deep-layer statistical denoising, and frame-aware clustering sampling. At 4-bit precision, QuantVGGT achieves 3.7× memory compression and 2.5× measured speedup over full-precision inference, while preserving over 98% of the original reconstruction accuracy. Our method establishes new state-of-the-art performance across multiple 3D reconstruction benchmarks.

Technology Category

Application Category

📝 Abstract
Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$ imes$ memory reduction and 2.5$ imes$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.
Problem

Research questions and friction points this paper is trying to address.

Compressing billion-scale Visual Geometry Grounded Transformers for efficient deployment
Addressing heavy-tailed activation distributions from data-independent special tokens
Stabilizing quantization ranges for multi-view 3D data calibration samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Smoothed Fine-Grained Quantization for heavy-tailed distributions
Noise-Filtered Diverse Sampling for stable calibration ranges
QuantVGGT achieves high compression with maintained accuracy
🔎 Similar Papers
No similar papers found.