🤖 AI Summary
Existing remote sensing visual grounding datasets predominantly rely on explicit referring expressions, which are ill-suited for implicit localization tasks requiring domain-specific knowledge. To address this gap, this work presents DVGBench—the first benchmark for implicit visual grounding in drone imagery—spanning six real-world application scenarios and introducing paired explicit–implicit queries. The authors propose DroneVG-R1, a model that integrates an Implicit-to-Explicit Chain-of-Thought (I2E-CoT) mechanism within a reinforcement learning framework to translate implicit references into actionable explicit expressions, thereby significantly enhancing localization performance. Experiments on DVGBench reveal that prevailing models struggle with implicit reasoning, whereas DroneVG-R1 achieves notably higher grounding accuracy.