🤖 AI Summary
Current visual language models (VLMs) lack systematic evaluation of multimodal mathematical reasoning—such as geometric computation, trajectory estimation, and spatial analysis—in drone remote sensing. Method: We introduce AVI-Math, the first drone-image-specific mathematical reasoning benchmark, comprising 3,773 high-quality, vehicle-centric questions spanning six domains: geometry, algebra, logic, trigonometry, calculus, and statistics. Unlike conventional counting benchmarks, AVI-Math features complex, real-world problems captured from multiple altitudes and viewpoints. We propose a Chain-of-Thought prompting and fine-tuning framework for VLM enhancement and conduct comprehensive evaluations across 14 state-of-the-art VLMs. Contribution/Results: Experiments reveal critical deficiencies in domain-knowledge integration and spatial reasoning; our method significantly improves performance. AVI-Math establishes a novel, rigorous foundation for evaluating and advancing mathematical understanding in trustworthy drone vision systems.
📝 Abstract
Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math