VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

📅 2024-11-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of high-quality 3D geometric and dynamic priors in vision-centric autonomous driving algorithms, this paper proposes VisionPAD—the first pretraining framework integrating 3D Gaussian Splatting with self-supervised voxel-wise velocity estimation for single- and multi-view image understanding. Without requiring depth or motion annotations, VisionPAD jointly optimizes differentiable projection, voxel-based motion deformation, self-supervised rendering losses, and multi-frame photometric consistency to enable co-learning of geometry and dynamics. Evaluated on nuScenes and SemanticKITTI benchmarks, VisionPAD consistently outperforms existing pretraining methods across three downstream tasks: 3D object detection, occupancy prediction, and semantic map segmentation. These results demonstrate the effectiveness and generalizability of purely image-driven spatiotemporal 3D representation learning.

Technology Category

Application Category

📝 Abstract
This paper introduces VisionPAD, a novel self-supervised pre-training paradigm designed for vision-centric algorithms in autonomous driving. In contrast to previous approaches that employ neural rendering with explicit depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to reconstruct multi-view representations using only images as supervision. Specifically, we introduce a self-supervised method for voxel velocity estimation. By warping voxels to adjacent frames and supervising the rendered outputs, the model effectively learns motion cues in the sequential data. Furthermore, we adopt a multi-frame photometric consistency approach to enhance geometric perception. It projects adjacent frames to the current frame based on rendered depths and relative poses, boosting the 3D geometric representation through pure image supervision. Extensive experiments on autonomous driving datasets demonstrate that VisionPAD significantly improves performance in 3D object detection, occupancy prediction and map segmentation, surpassing state-of-the-art pre-training strategies by a considerable margin.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised pre-training for autonomous driving vision algorithms
Efficient 3D reconstruction using multi-view image supervision
Enhancing 3D perception via voxel velocity and photometric consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Gaussian Splatting for multi-view reconstruction
Self-supervised voxel velocity estimation from sequential data
Multi-frame photometric consistency enhances geometric perception
🔎 Similar Papers
No similar papers found.