🤖 AI Summary
This work addresses the limitation of current autonomous driving systems, which struggle to leverage the vast amounts of unstructured dashcam video due to a lack of large-scale, structured multimodal sensor data. The authors propose a generative modeling paradigm that, for the first time, converts monocular dashcam footage into high-fidelity multimodal autonomous vehicle logs—comprising multi-view images and LiDAR point clouds—without requiring paired training data. Their approach integrates 4D Gaussian splatting for novel view synthesis and pseudo-paired data construction, coupled with diffusion models to enable high-quality cross-modal generation. The resulting synthetic data demonstrates exceptional fidelity and realism, effectively transforming long-tail driving scenarios from internet-sourced videos into standardized multimodal formats suitable for training and validation.
📝 Abstract
Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.