🤖 AI Summary
To address the lack of interpretability and standardized evaluation benchmarks for 3D scene description in autonomous driving, this paper proposes a lightweight multimodal BEV-based framework. Methodologically, it fuses LiDAR and multi-view imagery to construct bird’s-eye-view (BEV) features, incorporates view-specific absolute positional encodings, and leverages a 1B-parameter multimodal large language model for cross-modal alignment and natural language generation. Key contributions include: (1) introducing nuView and GroundView—the first dual benchmark datasets specifically designed for autonomous-driving-oriented 3D scene description; (2) achieving a 5% BLEU score improvement over state-of-the-art methods on nuCaption; and (3) establishing a novel evaluation paradigm that jointly assesses accuracy, efficiency, and interpretability, thereby providing a reliable benchmark for image-text generation in complex driving scenarios.
📝 Abstract
Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) - to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.