Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the communication latency and inefficient resource utilization caused by massive visual data transmission in multi-UAV collaborative perception. To this end, we propose a base-station-assisted collaborative perception framework (BHU) that incorporates a Top-K pixel selection mechanism to sparsify images, leverages MU-MIMO wireless transmission, and employs a Swin-Large-powered MaskDINO encoder on the ground server for bird’s-eye-view feature extraction and fusion. Innovatively, we integrate large vision models with a diffusion-model-enhanced deep reinforcement learning approach to jointly optimize UAV collaboration selection, sparsification ratio, and precoding matrices. Evaluated on the Air-Co-Pred dataset, our method achieves over a 5% improvement in perception performance compared to CNN-based baselines while reducing communication overhead by 85%.

Technology Category

Application Category

📝 Abstract

Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird's-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.

Problem

Research questions and friction points this paper is trying to address.

multi-UAV co-perception

communication latency

resource efficiency

low-altitude wireless networks

visual data transmission

Innovation

Methods, ideas, or system contributions that make the work stand out.

Top-K sparsification

MaskDINO

diffusion-based DRL