🤖 AI Summary
This work addresses the limitation of existing end-to-end autonomous driving systems, which lack scene-specific visual reasoning mechanisms and struggle to effectively model long-horizon future states. To overcome this, we propose a long-horizon world model tailored for autonomous driving that operates in bird’s-eye-view (BEV) space, enabling parallel prediction of latent semantic features across consecutive future frames. The model further incorporates an adaptive textual reasoning module that integrates social commonsense knowledge to enhance deep understanding of complex and long-tail driving scenarios. Evaluated on the closed-loop Bench2Drive benchmark, our approach achieves state-of-the-art performance, significantly improving decision robustness and generalization capabilities in challenging driving conditions.
📝 Abstract
End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.