TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing vision-language models (VLMs) are predominantly evaluated on frontal-view images, with systematic assessment of top-down view understanding severely lacking—primarily due to the absence of high-quality, diverse top-down image datasets. Method: We introduce TDBench, the first comprehensive benchmark for top-down image understanding, integrating real-world datasets (AID, DOTA) and synthetic scenes rendered in Unreal Engine. It defines ten-dimensional understanding tasks and four realistic application scenarios (e.g., navigation, aerial imagery analysis), and establishes the first multi-dimensional evaluation framework via structured visual question answering and standardized zero-shot/fine-tuning protocols. Contribution/Results: Comprehensive evaluation of mainstream VLMs reveals an average 23.6% accuracy drop on top-down tasks compared to frontal-view counterparts, exposing critical bottlenecks in spatial relation modeling, scale invariance, and contextual aggregation.

Technology Category

Application Category

📝 Abstract

The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research. Project homepage: https://github.com/Columbia-ICSL/TDBench

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs for top-down image understanding

Addressing scarcity of diverse top-down datasets

Assessing VLMs in spatial and contextual tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing TDBench for top-down image evaluation

Combining real and synthetic top-down datasets

Assessing VLMs across ten understanding dimensions

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment