DriveVGGT: Visual Geometry Transformer for Autonomous Driving

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VGGT models fail to generalize directly to autonomous driving due to their neglect of domain-specific priors—namely, minimal inter-frame viewpoint overlap, known camera intrinsics/extrinsics enabling absolute scale estimation, and rigid multi-camera geometry. To address this, we propose DriveVGGT, the first scale-aware 4D reconstruction framework tailored for autonomous driving. It explicitly integrates spatiotemporal continuity and rigid-body geometric constraints into a visual-geometric Transformer via three novel mechanisms: temporal video attention, multi-camera consistency attention, and normalized relative pose embeddings. Additionally, DriveVGGT introduces two dedicated heads—an absolute scale regression head and an ego-vehicle pose head—enabling end-to-end, scale-consistent 4D scene reconstruction. Extensive experiments on mainstream autonomous driving benchmarks demonstrate significant improvements over VGGT, StreamVGGT, and fastVGGT. Ablation studies validate the distinct and complementary contributions of each component.

Technology Category

Application Category

📝 Abstract
Feed-forward reconstruction has recently gained significant attention, with VGGT being a notable example. However, directly applying VGGT to autonomous driving (AD) systems leads to sub-optimal results due to the different priors between the two tasks. In AD systems, several important new priors need to be considered: (i) The overlap between camera views is minimal, as autonomous driving sensor setups are designed to achieve coverage at a low cost. (ii) The camera intrinsics and extrinsics are known, which introduces more constraints on the output and also enables the estimation of absolute scale. (iii) Relative positions of all cameras remain fixed though the ego vehicle is in motion. To fully integrate these priors into a feed-forward framework, we propose DriveVGGT, a scale-aware 4D reconstruction framework specifically designed for autonomous driving data. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, aiming to establish consistency relationships across different cameras while restricting each token to attend only to nearby frames. Finally, we extend the standard VGGT heads by adding an absolute scale head and an ego vehicle pose head. Experiments show that DriveVGGT outperforms VGGT, StreamVGGT, fastVGGT on autonomous driving dataset while extensive ablation studies verify effectiveness of the proposed designs.
Problem

Research questions and friction points this paper is trying to address.

Adapts VGGT for autonomous driving with multi-camera constraints
Integrates known camera parameters to estimate absolute scale accurately
Ensures cross-camera consistency using normalized pose embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Video Attention module processes multi-camera videos independently
Multi-camera Consistency Attention module uses normalized pose embeddings for cross-camera consistency
Extended VGGT heads include absolute scale and ego vehicle pose estimation
🔎 Similar Papers
No similar papers found.