From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the trade-off between accuracy and computational cost in end-to-end autonomous driving by proposing two novel architectures, HybridDriveVLA and DualDriveVLA. These frameworks deploy a vision-language model (VLM) and a Vision Transformer (ViT) in parallel, leveraging their complementary strengths in handling long-tail driving scenarios. A confidence-based trajectory scorer and a fast-slow policy scheduling mechanism enable dynamic collaborative inference, adaptively allocating computational resources. Experimental results demonstrate that HybridDriveVLA achieves a PDMS score of 92.10, while DualDriveVLA attains 91.00 PDMS by invoking the VLM in only 15% of scenarios, yielding a 3.2× improvement in inference throughput.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
End-to-End Driving
Representational Complementarity
Long-tail Scenarios
Autonomous Driving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
End-to-End Driving
Representational Complementarity
Dual-System Architecture
Trajectory Selection
🔎 Similar Papers
No similar papers found.