π€ AI Summary
This work addresses the absence of a unified and reproducible evaluation benchmark for vision-language models (VLMs) and autonomous agents in real-world aerial road damage localization. We introduce the first road damage localization benchmark based on professionally annotated drone imagery, uniquely integrating the direct visual grounding capabilities of VLMs with the engineering-oriented reasoning of large language modelβdriven autonomous agents within the same realistic setting. Fair assessment is enabled through a dual-track protocol, a standardized prompt-parsing pipeline, the APβ
β metric, and a feedback oracle mechanism. Experiments reveal that state-of-the-art closed-source VLMs lead in performance yet still exhibit room for improvement, while open-source models fail to localize small damages effectively. Although most agents demonstrate interactive capabilities, they struggle to produce valid submissions under resource constraints.
π Abstract
We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at https://anonymous.4open.science/r/wildroadbench-0607 to support reproducible follow-up research.