ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving

šŸ“… 2025-05-21
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Existing end-to-end autonomous driving systems struggle to jointly optimize visual perception performance and natural language understanding. This paper proposes a unified cross-modal collaborative distillation framework—the first to achieve inference-free, joint visual–linguistic optimization across the full perception–prediction–planning stack. We introduce a three-tier alignment mechanism: (1) token-level alignment (P1A), (2) multi-stage semantic alignment (P2A), and (3) multi-benchmark joint training (P3A), enabling complete cross-modal alignment during training while incurring zero inference latency and zero parameter overhead. Leveraging collaborative distillation between large language models and vision backbones, our method significantly enhances both driving decision-making and language comprehension capabilities. We establish new state-of-the-art results on four benchmarks—nuScenes, Nu-X, TOD3Cap, and nuScenes QA—demonstrating the first efficient, full-stack visual–language alignment in autonomous driving.

Technology Category

Application Category

šŸ“ Abstract
Recent advances have explored integrating large language models (LLMs) into end-to-end autonomous driving systems to enhance generalization and interpretability. However, most existing approaches are limited to either driving performance or vision-language reasoning, making it difficult to achieve both simultaneously. In this paper, we propose ALN-P3, a unified co-distillation framework that introduces cross-modal alignment between"fast"vision-based autonomous driving systems and"slow"language-driven reasoning modules. ALN-P3 incorporates three novel alignment mechanisms: Perception Alignment (P1A), Prediction Alignment (P2A), and Planning Alignment (P3A), which explicitly align visual tokens with corresponding linguistic outputs across the full perception, prediction, and planning stack. All alignment modules are applied only during training and incur no additional costs during inference. Extensive experiments on four challenging benchmarks-nuScenes, Nu-X, TOD3Cap, and nuScenes QA-demonstrate that ALN-P3 significantly improves both driving decisions and language reasoning, achieving state-of-the-art results.
Problem

Research questions and friction points this paper is trying to address.

Integrating LLMs into autonomous driving for better generalization and interpretability
Balancing driving performance and vision-language reasoning simultaneously
Aligning visual tokens with linguistic outputs across perception, prediction, planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified co-distillation framework for autonomous driving
Cross-modal alignment between vision and language modules
Three novel alignment mechanisms for perception, prediction, planning
šŸ”Ž Similar Papers
No similar papers found.