AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing benchmarks inadequately evaluate multimodal large language models’ (MLLMs) capabilities in embodied, collaborative perception and reasoning under complex, degraded visual conditions—particularly in multi-drone settings. Method: We introduce AirCopBench, the first comprehensive benchmark for embodied collaborative perception, comprising over 146,000 multi-view question-answer samples from both simulated and real-world scenarios. It covers four task categories: scene understanding, object identification, perceptual assessment, and collaborative decision-making. Crucially, it features the first-ever annotation of multi-agent interaction events under adverse conditions, enabling cross-task consistency and collaborative reasoning evaluation. Data construction employs a simulation-to-reality hybrid strategy with rigorous quality-controlled mixed annotation. Contribution/Results: Evaluation across 40 MLLMs reveals that state-of-the-art models still underperform humans by 24.38%, exhibiting severe task-wise inconsistency. Fine-tuning experiments confirm effective simulation-to-reality transfer, validating AirCopBench’s utility for advancing robust, collaborative multimodal intelligence.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal models in multi-drone collaborative perception scenarios

Addressing performance gaps in embodied aerial perception under degraded conditions

Benchmarking collaborative reasoning across simulator and real-world drone data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-drone collaborative perception benchmark

Simulator and real-world degraded perception data

Model rule human methods generate questions

🔎 Similar Papers

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models

2024-08-28arXiv.orgCitations: 10