🤖 AI Summary
Existing vision-language models (VLMs) are predominantly evaluated on frontal-view images, with systematic assessment of top-down view understanding severely lacking—primarily due to the absence of high-quality, diverse top-down image datasets. Method: We introduce TDBench, the first comprehensive benchmark for top-down image understanding, integrating real-world datasets (AID, DOTA) and synthetic scenes rendered in Unreal Engine. It defines ten-dimensional understanding tasks and four realistic application scenarios (e.g., navigation, aerial imagery analysis), and establishes the first multi-dimensional evaluation framework via structured visual question answering and standardized zero-shot/fine-tuning protocols. Contribution/Results: Comprehensive evaluation of mainstream VLMs reveals an average 23.6% accuracy drop on top-down tasks compared to frontal-view counterparts, exposing critical bottlenecks in spatial relation modeling, scale invariance, and contextual aggregation.
📝 Abstract
The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research. Project homepage: https://github.com/Columbia-ICSL/TDBench