UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

📅 2024-08-30

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing LMM benchmarks are limited to single-view, region-level tasks, failing to reflect models’ true capabilities in complex urban scenes. This paper introduces UrBench, the first comprehensive benchmark for multi-view urban场景 evaluation, covering 14 tasks across four dimensions: geolocation, scene reasoning, scene understanding, and object understanding—comprising 11.6K high-quality region- and role-level questions. Methodologically, we integrate field-collected imagery from 11 cities with public resources; questions are generated via LMMs, refined by rule-based validation, and finalized through human expert annotation, supported by multi-source collaborative labeling and multi-strategy verification. Key contributions include: (1) the first systematic evaluation of LMMs’ cross-view holistic reasoning; (2) a novel cross-view detection-matching annotation protocol; and (3) role-level semantic tasks that transcend conventional benchmarking constraints. Experiments on 21 state-of-the-art LMMs reveal GPT-4o underperforms humans by 17.4% on average, with widespread deficiencies in cross-view relational consistency, precise localization, and attribute recognition.

Technology Category

Application Category

📝 Abstract

Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a cross-view detection-matching method. With these images and annotations, we then integrate LMM-based, rule-based, and human-based methods to construct large-scale high-quality questions. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects. Even the best performing GPT-4o lags behind humans in most tasks, ranging from simple tasks such as counting to complex tasks such as orientation, localization and object attribute recognition, with an average performance gap of 17.4%. Our benchmark also reveals that LMMs exhibit inconsistent behaviors with different urban views, especially with respect to understanding cross-view relations.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LMMs in complex multi-view urban scenarios.

Addresses gaps in LMM performance for urban tasks.

Highlights LMM struggles with cross-view understanding.

Innovation

Methods, ideas, or system contributions that make the work stand out.

UrBench evaluates LMMs in multi-view urban scenarios.

Utilizes cross-view detection-matching for data annotation.

Integrates LMM-based, rule-based, and human-based methods.

🔎 Similar Papers

CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks