GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of jointly modeling environmental geometry, semantic structure, and their temporal evolution in autonomous driving, this paper introduces the first unified geometric-semantic 4D occupancy field self-supervised pretraining paradigm. Methodologically, it learns structured and generalizable dynamic scene representations through spatiotemporal occupancy prediction at 4D points, ego-occupancy modeling, and distillation of high-level semantic features from vision foundation models (VFMs). Key contributions include: (1) the first self-supervised 4D occupancy field formulation integrating geometry and semantics; and (2) the first integration of VFM-based semantic distillation into spatiotemporal prediction, enabling unified multi-task representation learning. The approach achieves significant performance gains on semantic occupancy prediction, online map construction, and ego-vehicle trajectory forecasting—demonstrating strong generalization capability and practical deployment potential.

Technology Category

Application Category

📝 Abstract
Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see href{https://research.zenseact.com/publications/gasp/.
Problem

Research questions and friction points this paper is trying to address.

Unifies geometric and semantic self-supervised pre-training for autonomous driving.
Predicts 4D occupancy fields to model environment evolution over time.
Improves semantic occupancy forecasting, mapping, and ego trajectory prediction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified geometric and semantic self-supervised pre-training
Predicts 4D occupancy fields for environment modeling
Integrates vision foundation model for high-level features
🔎 Similar Papers
No similar papers found.