DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the limitation of existing autonomous driving benchmarks, which predominantly focus on individual traffic rules and thus fail to evaluate models’ true comprehension under concurrent or conflicting rule scenarios. To this end, the authors propose DriveCombo, a multimodal benchmark featuring a novel five-level cognitive evaluation framework specifically designed for compositional traffic rule reasoning. Central to this framework is Rule2Scene, an agent that automatically translates linguistic rules into dynamic driving scenarios. Integrating multimodal large language models, rule-driven scenario generation, and hierarchical cognitive assessment, the framework was validated across 14 mainstream MLLMs. Results reveal a significant performance drop in complex rule-following tasks, while fine-tuning with DriveCombo effectively enhances both rule reasoning and downstream planning capabilities.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers' cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.

Problem

Research questions and friction points this paper is trying to address.

compositional reasoning

traffic rule understanding

multimodal large language models

rule conflict

autonomous driving benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional reasoning

traffic rule benchmark

multimodal large language models