BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current large language models (LLMs) in autonomous driving, which process multi-view images independently, leading to computational redundancy and spatial inconsistency that hinder accurate 3D reasoning. While bird’s-eye-view (BEV) representations offer strong geometric structure, they often lack rich semantic understanding. To bridge this gap, the authors propose BEVLM, a novel framework that enables bidirectional fusion between LLMs and BEV for the first time: unified BEV features are fed into the LLM to enhance its spatial reasoning, while the LLM’s semantic knowledge is distilled back into the BEV representation to enrich its semantic content. Integrating a geometry-pretrained BEV encoder, a semantic distillation mechanism, and an end-to-end driving architecture, BEVLM improves LLM reasoning accuracy by 46% in cross-view scenarios and boosts closed-loop driving performance by 29% in safety-critical situations.

Technology Category

Application Category

📝 Abstract
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Bird's-Eye View
autonomous driving
spatial consistency
semantic representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

BEV representation
LLM distillation
spatial consistency
semantic knowledge transfer
autonomous driving
🔎 Similar Papers
No similar papers found.