🤖 AI Summary
End-to-end (E2E) autonomous driving suffers from decision failures in closed-loop evaluation due to insufficient causal reasoning and misalignment between the semantic space of vision-language models (VLMs) and the numerical trajectory action space. To address this, we propose QT-Former–LLM–Generative Planner, a synergistic framework comprising three core innovations: (1) QT-Former, the first transformer-based architecture for long-horizon visual temporal modeling; (2) an LLM-driven module for scene-level semantic and causal reasoning; and (3) a cross-modal space alignment mechanism enabling unified end-to-end joint optimization of visual question answering (VQA) comprehension and trajectory generation. Evaluated on Bench2Drive, our method achieves 77.74 driving score (DS) and 54.62% success rate (SR), surpassing state-of-the-art by +14.28 DS and +19.61% SR—significantly bridging the gap between high-level semantic understanding and low-level motion control.
📝 Abstract
End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.