🤖 AI Summary
This work addresses key challenges in high-level autonomous driving—namely, the insufficient robustness of motion planners under closed-loop interaction, difficulties in modeling multimodal future uncertainty, and instability arising from the lack of negative feedback in pure imitation learning. To this end, the authors propose a generator-discriminator collaborative framework: a diffusion-based generator produces diverse trajectory candidates, while a reinforcement learning–optimized discriminator re-ranks them based on long-horizon driving quality, thereby avoiding sparse rewards acting directly in the high-dimensional trajectory space. The approach innovatively introduces temporally consistent group relative policy optimization to mitigate credit assignment issues, incorporates an online generator fine-tuning mechanism that translates closed-loop feedback into longitudinal optimization signals, and leverages an efficient BEV-Warp simulation environment for scalable training. Compared to a strong diffusion-based planning baseline, the method reduces collision rates by 56% and demonstrates significantly improved perceptual safety and ride smoothness in real-world urban driving scenarios.
📝 Abstract
High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.