TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

📅 2025-01-10

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) lack systematic evaluation for spatiotemporal traffic behavior understanding—particularly under autonomous driving requirements. To address this gap, we introduce TB-Bench, the first MLLM benchmark tailored for autonomous driving, comprising eight ego-centric spatiotemporal perception tasks. We further construct TB-100k and TB-250k vision-language instruction-tuning datasets—the first task-driven paradigm for ego-centric multimodal instruction data curation. Additionally, we design a lightweight baseline model integrating spatiotemporal modeling, vision-language alignment, and an efficient vision-encoder–LLM adaptation architecture. Experiments show that fine-tuned baselines achieve an average accuracy of 85%, substantially outperforming GPT-4o (<35%). Moreover, cross-dataset collaborative training enables effective knowledge transfer, improving performance on other traffic-related benchmarks.

Technology Category

Application Category

📝 Abstract

The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs underperform in these tasks, with even a powerful model like GPT-4o achieving less than 35% accuracy on average. In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer by co-training TB-100k with another traffic dataset, leading to improved performance on the latter. Overall, this study represents a step forward by introducing a comprehensive benchmark, high-quality datasets, and baselines, thus supporting the gradual integration of MLLMs into the perception, prediction, and planning stages of AD.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Language Models

Temporal and Spatial Understanding

Autonomous Vehicles Application

Innovation

Methods, ideas, or system contributions that make the work stand out.

TB-Bench

Multi-modal Large Language Models

Autonomous Driving

🔎 Similar Papers

No similar papers found.