🤖 AI Summary
Existing benchmarks inadequately evaluate multimodal large language models’ (MLLMs) capabilities in embodied, collaborative perception and reasoning under complex, degraded visual conditions—particularly in multi-drone settings.
Method: We introduce AirCopBench, the first comprehensive benchmark for embodied collaborative perception, comprising over 146,000 multi-view question-answer samples from both simulated and real-world scenarios. It covers four task categories: scene understanding, object identification, perceptual assessment, and collaborative decision-making. Crucially, it features the first-ever annotation of multi-agent interaction events under adverse conditions, enabling cross-task consistency and collaborative reasoning evaluation. Data construction employs a simulation-to-reality hybrid strategy with rigorous quality-controlled mixed annotation.
Contribution/Results: Evaluation across 40 MLLMs reveals that state-of-the-art models still underperform humans by 24.38%, exhibiting severe task-wise inconsistency. Fine-tuning experiments confirm effective simulation-to-reality transfer, validating AirCopBench’s utility for advancing robust, collaborative multimodal intelligence.
📝 Abstract
Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.