π€ AI Summary
To address the reliance of end-to-end autonomous driving on large-scale manually annotated control labels and external pre-trained models, this paper proposes SSILβthe first fully self-supervised framework. SSIL eliminates the need for ground-truth steering angle annotations or off-the-shelf pretraining by leveraging only onboard camera and LiDAR data; it generates high-fidelity pseudo-steering labels via precise LiDAR-based ego-pose estimation, enabling self-supervised imitation learning. The method integrates multimodal feature fusion, instruction-conditioned network architecture, and self-supervised regression learning (SSRL). Evaluated on three mainstream benchmarks, SSIL achieves control accuracy competitive with fully supervised methods, while its pseudo-label generator significantly outperforms PID-based baselines. The core contribution is the establishment of the first end-to-end, label-free, pretraining-free self-supervised driving paradigm.
π Abstract
In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this letter proposes the first fully self-supervised learning framework, self-supervised imitation learning (SSIL), for E2E driving, based on the self-supervised regression learning (SSRL) framework.The proposed SSIL framework can learn E2E driving networks emph{without} using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose two E2E driving networks that predict driving commands depending on high-level instruction. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves emph{very} comparable E2E driving accuracy with the supervised learning counterpart. The proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller.