🤖 AI Summary
This work addresses critical limitations of existing end-to-end vision-language models in autonomous driving—namely, insufficient lane perception, language interpretation inaccuracies, and poor robustness in extreme scenarios. To overcome these challenges, we propose AppleVLM, a novel vision-language architecture that jointly enhances perception and planning by integrating multi-view spatiotemporal imagery with an explicit bird’s-eye-view planning modality. Our approach introduces three key innovations: a deformable Transformer-based visual encoder, a tri-modal alignment mechanism bridging vision, language, and planning, and a hierarchical Chain-of-Thought fine-tuning strategy. Experimental results demonstrate that AppleVLM achieves state-of-the-art performance on two CARLA benchmarks and successfully enables end-to-end autonomous driving in complex outdoor environments on a real-world AGV platform.
📝 Abstract
End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable transformer mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird's-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM decoder fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.