RefAV: Towards Planning-Centric Scenario Mining

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In autonomous driving safety verification, efficiently identifying safety-critical scenarios from massive in-vehicle multimodal logs remains challenging due to reliance on manual querying, low precision, and high computational cost. To address this, we propose the first motion-planning-centric, natural language–driven spatiotemporal scenario mining framework. Our contributions are threefold: (1) We introduce RefAV, a benchmark dataset comprising 10,000 planning-oriented natural language queries; (2) We integrate vision-language models (VLMs), referential multi-object tracking, and HD-map–aligned spatiotemporal grounding to enable semantic-level retrieval and millimeter-accurate spatiotemporal localization; (3) We empirically reveal fundamental limitations of off-the-shelf VLMs for this task and establish a reproducible experimental baseline. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html
Problem

Research questions and friction points this paper is trying to address.

Identifying safety-critical scenarios from uncurated AV driving logs
Improving error-prone traditional scenario mining techniques
Localizing described scenarios in time and space using VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models for scenario mining
Introduces RefAV dataset with natural language queries
Evaluates multi-object trackers for precise localization
🔎 Similar Papers
No similar papers found.