🤖 AI Summary
This work addresses the limitations of pure imitation learning—constrained by demonstration quality—and existing reinforcement learning (RL) fine-tuning approaches, which often suffer from policy drift and suboptimal performance in end-to-end autonomous driving. To overcome these issues, the authors propose PaIR-Drive, a novel framework that jointly optimizes imitation learning and RL in parallel during training to avoid objective conflicts, and leverages the imitation policy to guide RL trajectory generation at inference time. Key innovations include a parallel dual-branch architecture, a tree-structured trajectory neural sampler for enhanced exploration, and Grouped Relative Policy Optimization (GRPO), with support for plug-and-play integration of new imitation policies. Evaluated on NAVSIMv1 and NAVSIMv2, PaIR-Drive achieves 91.2 PDMS and 87.9 EPDMS, respectively, significantly outperforming prior RL fine-tuning methods and effectively correcting suboptimal human behaviors to produce high-quality driving trajectories.
📝 Abstract
End-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) through sequential fine-tuning. However, such a paradigm remains suboptimal: sequential RL fine-tuning can introduce policy drift and often leads to a performance ceiling due to its dependence on the pretrained IL policy. To address these issues, we propose PaIR-Drive, a general Parallel framework for collaborative Imitation and Reinforcement learning in end-to-end autonomous driving. During training, PaIR-Drive separates IL and RL into two parallel branches with conflict-free training objectives, enabling fully collaborative optimization. This design eliminates the need to retrain RL when applying a new IL policy. During inference, RL leverages the IL policy to further optimize the final plan, allowing performance beyond prior knowledge of IL. Furthermore, we introduce a tree-structured trajectory neural sampler to group relative policy optimization (GRPO) in the RL branch, which enhances exploration capability. Extensive analysis on NAVSIMv1 and v2 benchmark demonstrates that PaIR-Drive achieves Competitive performance of 91.2 PDMS and 87.9 EPDMS, building upon Transfuser and DiffusionDrive IL baselines. PaIR-Drive consistently outperforms existing RL fine-tuning methods, and could even correct human experts' suboptimal behaviors. Qualitative results further confirm that PaIR-Drive can effectively explore and generate high-quality trajectories.