BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the lack of interpretability and standardized evaluation benchmarks for 3D scene description in autonomous driving, this paper proposes a lightweight multimodal BEV-based framework. Methodologically, it fuses LiDAR and multi-view imagery to construct bird’s-eye-view (BEV) features, incorporates view-specific absolute positional encodings, and leverages a 1B-parameter multimodal large language model for cross-modal alignment and natural language generation. Key contributions include: (1) introducing nuView and GroundView—the first dual benchmark datasets specifically designed for autonomous-driving-oriented 3D scene description; (2) achieving a 5% BLEU score improvement over state-of-the-art methods on nuCaption; and (3) establishing a novel evaluation paradigm that jointly assesses accuracy, efficiency, and interpretability, thereby providing a reliable benchmark for image-text generation in complex driving scenarios.

Technology Category

Application Category

📝 Abstract

Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) - to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Enhancing autonomous driving transparency with scene captioning

Combining LiDAR and images for 3D driving scene descriptions

Addressing dataset gaps for diverse driving scenario evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines LiDAR and images with BEVFusion

Uses absolute positional encoding for descriptions

Achieves high performance with 1B parameters

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs