🤖 AI Summary
Current multimodal large language models (MLLMs) lack systematic evaluation for spatiotemporal traffic behavior understanding—particularly under autonomous driving requirements. To address this gap, we introduce TB-Bench, the first MLLM benchmark tailored for autonomous driving, comprising eight ego-centric spatiotemporal perception tasks. We further construct TB-100k and TB-250k vision-language instruction-tuning datasets—the first task-driven paradigm for ego-centric multimodal instruction data curation. Additionally, we design a lightweight baseline model integrating spatiotemporal modeling, vision-language alignment, and an efficient vision-encoder–LLM adaptation architecture. Experiments show that fine-tuned baselines achieve an average accuracy of 85%, substantially outperforming GPT-4o (<35%). Moreover, cross-dataset collaborative training enables effective knowledge transfer, improving performance on other traffic-related benchmarks.
📝 Abstract
The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs underperform in these tasks, with even a powerful model like GPT-4o achieving less than 35% accuracy on average. In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer by co-training TB-100k with another traffic dataset, leading to improved performance on the latter. Overall, this study represents a step forward by introducing a comprehensive benchmark, high-quality datasets, and baselines, thus supporting the gradual integration of MLLMs into the perception, prediction, and planning stages of AD.