AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately evaluate multimodal large language models’ (MLLMs) capabilities in embodied, collaborative perception and reasoning under complex, degraded visual conditions—particularly in multi-drone settings. Method: We introduce AirCopBench, the first comprehensive benchmark for embodied collaborative perception, comprising over 146,000 multi-view question-answer samples from both simulated and real-world scenarios. It covers four task categories: scene understanding, object identification, perceptual assessment, and collaborative decision-making. Crucially, it features the first-ever annotation of multi-agent interaction events under adverse conditions, enabling cross-task consistency and collaborative reasoning evaluation. Data construction employs a simulation-to-reality hybrid strategy with rigorous quality-controlled mixed annotation. Contribution/Results: Evaluation across 40 MLLMs reveals that state-of-the-art models still underperform humans by 24.38%, exhibiting severe task-wise inconsistency. Fine-tuning experiments confirm effective simulation-to-reality transfer, validating AirCopBench’s utility for advancing robust, collaborative multimodal intelligence.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal models in multi-drone collaborative perception scenarios
Addressing performance gaps in embodied aerial perception under degraded conditions
Benchmarking collaborative reasoning across simulator and real-world drone data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-drone collaborative perception benchmark
Simulator and real-world degraded perception data
Model rule human methods generate questions
🔎 Similar Papers
No similar papers found.
J
Jirong Zha
Shenzhen International Graduate School, Tsinghua University
Yuxuan Fan
Yuxuan Fan
Peking University
Natural Language Processing
T
Tianyu Zhang
School of Electrical and Electronic Engineering, Nanyang Technological University
G
Geng Chen
College of Software, Jilin University
Y
Yingfeng Chen
Weiyang College, Tsinghua University
C
Chen Gao
BNRist, Tsinghua University
X
Xinlei Chen
Shenzhen International Graduate School, Tsinghua University