A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing evaluations of multimodal large language models (MLLMs) for autonomous driving lack systematic assessment of scene understanding capabilities. Method: This paper proposes the first capability-driven, full-stack scene understanding evaluation framework, grounded in three pillars: autonomous driving system requirements, human driving cognition, and linguistic reasoning. It introduces four core capability dimensions—semantic, spatial, temporal, and physical—and integrates hierarchical context modeling, multimodal situational analysis, and traffic-scene instantiation for validation. Contribution/Results: The framework is empirically validated on two real-world traffic scenarios, significantly enhancing interpretability, scalability, and reproducibility of MLLM evaluation. It establishes a theoretically grounded and practically applicable benchmark for trustworthy MLLM deployment in autonomous driving systems.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) hold the potential to enhance autonomous driving by combining domain-independent world knowledge with context-specific language guidance. Their integration into autonomous driving systems shows promising results in isolated proof-of-concept applications, while their performance is evaluated on selective singular aspects of perception, reasoning, or planning. To leverage their full potential a systematic framework for evaluating MLLMs in the context of autonomous driving is required. This paper proposes a holistic framework for a capability-driven evaluation of MLLMs in autonomous driving. The framework structures scenario understanding along the four core capability dimensions semantic, spatial, temporal, and physical. They are derived from the general requirements of autonomous driving systems, human driver cognition, and language-based reasoning. It further organises the domain into context layers, processing modalities, and downstream tasks such as language-based interaction and decision-making. To illustrate the framework's applicability, two exemplary traffic scenarios are analysed, grounding the proposed dimensions in realistic driving situations. The framework provides a foundation for the structured evaluation of MLLMs' potential for scenario understanding in autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

Evaluate multimodal large language models in autonomous driving.

Develop a framework for scenario understanding capabilities.

Assess semantic, spatial, temporal, and physical dimensions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Holistic framework for MLLM evaluation in autonomous driving

Four core capability dimensions: semantic, spatial, temporal, physical

Organizes domain into context layers, modalities, and tasks

🔎 Similar Papers

No similar papers found.