🤖 AI Summary
Traditional end-to-end autonomous driving models lack structured reasoning capabilities, resulting in limited generalization and poor robustness in complex scenarios. Existing vision-language models often rely on isolated modules and static supervision, failing to support multi-stage decision-making. This paper proposes AutoDriveRL—the first unified vision-language reasoning framework explicitly designed for the four sequential stages of perception, prediction, planning, and behavior execution. Each stage is formulated as a structured visual question answering (VQA) task, and stage-specific reward models drive fine-grained reinforcement learning. We introduce two key innovations: (1) a reward-guided reasoning compression mechanism and (2) dynamic structured prompt engineering—enabling interpretable, real-time multi-stage decisions. On public benchmarks, AutoDriveRL surpasses GPT-4o in behavioral reasoning performance and significantly improves robustness under image degradation and in complex driving scenarios. The code and models are publicly released.
📝 Abstract
Autonomous driving requires real-time, robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. Recent vision-language models (VLMs) have been applied to driving tasks, but they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language question-answering problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for real-time decision-making. DriveRX achieves strong performance on a public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. Our analysis further highlights the impact of vision encoder design and reward-guided reasoning compression. We will release the AutoDriveRL framework and the DriveRX model to support future research.