123D: Unifying Multi-Modal Autonomous Driving Data at Scale

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of unifying and generalizing across diverse autonomous driving datasets, which differ significantly in modality, format, synchronization protocols, and annotation standards. To this end, the authors propose 123D, an open-source framework that introduces a novel timestamped event stream model operating without predefined sampling rates. This model represents heterogeneous multimodal data—including camera images, LiDAR point clouds, and high-definition maps—as independent, timestamped event streams, enabling both synchronized and asynchronous access across arbitrary datasets. The framework integrates eight real-world and one synthetic driving dataset, encompassing 3,300 hours and 90,000 kilometers of real driving data, and for the first time systematically aligns annotations, poses, and calibrations across multiple datasets. Experiments demonstrate that 123D effectively facilitates cross-dataset transfer in 3D object detection and reinforcement learning–based planning, validating its generality and practical utility.

📝 Abstract

The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset's pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at https://github.com/kesai-labs/py123d.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

multi-modal data

data unification

annotation inconsistency

sensor synchronization

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal data unification

timestamped event stream

cross-dataset generalization