Visual Implicit Geometry Transformer for Autonomous Driving

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of estimating continuous 3D occupancy fields from surround-view images in autonomous driving by proposing ViGT, a calibration-free Transformer architecture that directly predicts bird’s-eye-view (BEV) 3D occupancy from multi-view images. ViGT leverages implicit geometric modeling and multi-view geometry fusion to establish a unified geometric representation in BEV space. It employs a self-supervised strategy that jointly utilizes image and LiDAR signals without requiring manual annotations, enabling mixed training across datasets and sensor configurations. The method achieves state-of-the-art performance on five major autonomous driving benchmarks, demonstrates superior average results in point map estimation tasks, and matches the performance of supervised approaches on the Occ3D-nuScenes benchmark.

Technology Category

Application Category

📝 Abstract
We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.
Problem

Research questions and friction points this paper is trying to address.

3D occupancy
autonomous driving
geometric representation
multi-view geometry
self-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

calibration-free architecture
continuous 3D occupancy field
self-supervised learning
birds-eye-view (BEV) representation
geometric foundation model
🔎 Similar Papers
No similar papers found.
A
Arsenii Shirokov
Lomonosov Moscow State University
Mikhail Kuznetsov
Mikhail Kuznetsov
AWS
Machine Learning
D
Danila Stepochkin
Lomonosov Moscow State University
E
Egor Evdokimov
Lomonosov Moscow State University
D
Daniil Glazkov
Lomonosov Moscow State University
N
Nikolay Patakin
Lomonosov Moscow State University
Anton Konushin
Anton Konushin
Lomonosov Moscow State University
computer visiondeep learningcomputer graphics
Dmitry Senushkin
Dmitry Senushkin
Unknown affiliation
Computer VisionComputer GraphicsDeep LearningMath