ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the absence of a unified evaluation benchmark for multimodal large language model (MLLM)-driven embodied intelligent drones in search-and-rescue scenarios. To bridge this gap, we introduce the Embodied Search-And-Rescue (ESAR) task and present the first high-fidelity, open-ended benchmark, constructed using Unreal Engine 5 and AirSim with real-world geographic data to simulate complex, dynamic environments. The benchmark incorporates authentic mission scenarios and multidimensional evaluation metrics. Experimental results reveal significant limitations in current approaches, particularly in spatial memory retention, aerial adaptability, and the trade-off between search efficiency and flight safety, thereby demonstrating the benchmark’s effectiveness and inherent challenges.

📝 Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently suited for UAV Search and Rescue (SAR). However, existing UAV SAR research is dominated by traditional vision and path-planning methods and lacks a comprehensive and unified benchmark for embodied agents. To bridge this gap, we first propose the novel task of \textbf{Embodied Search and Rescue (ESAR)}, which requires aerial agents to autonomously explore complex environments, identify rescue clues, and reason about victim locations to execute informed decision-making. Additionally, we present \textbf{ESARBench}, the first comprehensive benchmark designed to evaluate MLLM-driven UAV agents in highly realistic SAR scenarios. Leveraging Unreal Engine 5 and AirSim, we construct four high-fidelity, large-scale open environments mapped directly from real-world Geographic Information System (GIS) data to ensure photorealistic landscapes. To rigorously simulate actual rescue operations, our benchmark incorporates dynamic variables including weather conditions, time of day, and stochastic clue placement. Furthermore, we create a dataset of 600 tasks modeled after real-world rescue cases and propose a robust set of evaluation metrics. We evaluate diverse baselines, ranging from traditional heuristics to advanced ground and aerial MLLM-based ObjectNav agents. Experimental results highlight the challenges in ESAR, revealing critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety. We hope ESARBench serves as a valuable resource to advance research on Embodied Search and Rescue domain. Source code and project page: https://4amgodvzx.github.io/ESAR.github.io.

Problem

Research questions and friction points this paper is trying to address.

Embodied Search and Rescue

UAV SAR

benchmark

Multimodal Large Language Models

aerial agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied Search and Rescue

Multimodal Large Language Models

UAV Benchmark

Photorealistic Simulation

Aerial Agent Evaluation

🔎 Similar Papers

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models

2024-08-28arXiv.orgCitations: 10

CloudTrack: Scalable UAV Tracking with Cloud Semantics

2024-09-24arXiv.orgCitations: 0