TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) are predominantly evaluated on frontal-view images, with systematic assessment of top-down view understanding severely lacking—primarily due to the absence of high-quality, diverse top-down image datasets. Method: We introduce TDBench, the first comprehensive benchmark for top-down image understanding, integrating real-world datasets (AID, DOTA) and synthetic scenes rendered in Unreal Engine. It defines ten-dimensional understanding tasks and four realistic application scenarios (e.g., navigation, aerial imagery analysis), and establishes the first multi-dimensional evaluation framework via structured visual question answering and standardized zero-shot/fine-tuning protocols. Contribution/Results: Comprehensive evaluation of mainstream VLMs reveals an average 23.6% accuracy drop on top-down tasks compared to frontal-view counterparts, exposing critical bottlenecks in spatial relation modeling, scale invariance, and contextual aggregation.

Technology Category

Application Category

📝 Abstract
The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research. Project homepage: https://github.com/Columbia-ICSL/TDBench
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs for top-down image understanding
Addressing scarcity of diverse top-down datasets
Assessing VLMs in spatial and contextual tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing TDBench for top-down image evaluation
Combining real and synthetic top-down datasets
Assessing VLMs across ten understanding dimensions
🔎 Similar Papers
No similar papers found.
K
Kaiyuan Hou
Department of Electrical Engineering, Columbia University
M
Minghui Zhao
Department of Electrical Engineering, Columbia University
L
Lilin Xu
Department of Electrical Engineering, Columbia University
Yuang Fan
Yuang Fan
PhD Student, Columbia University
Xiaofan Jiang
Xiaofan Jiang
Associate Professor of Electrical Engineering, Columbia University
Mobile and Embedded SystemsArtificial Intelligence of ThingsSmart Health and FitnessCPHS