CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address excessive communication overhead caused by transmitting redundant BEV feature maps in multi-agent collaborative perception, this paper proposes a lightweight query-based collaboration framework. We introduce a novel object-query-driven collaboration paradigm, designing an Efficient Query Transformer (EQFormer) and an inter-stage collaborative deep supervision mechanism that transmits only semantically critical object-query features—enabling joint optimization of communication and perception. Our method integrates cross-modal Transformers, object-query encoding, BEV feature sparsification, and multi-stage supervision. On V2V4Real, it reduces communication bandwidth to 0.416 Mb—just 1/83 of the state-of-the-art—while improving AP₇₀ by 1.1%; on OPV2V, it consistently outperforms existing methods. The framework significantly enhances practicality under bandwidth constraints and achieves superior accuracy-efficiency trade-offs.

Technology Category

Application Category

📝 Abstract

Multi-agent collaborative perception enhances each agent perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long-range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird-eye view (BEV) representations, which contain a significant amount of non-critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object-query-based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi-agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state-of-the-art methods while drastically reducing communication needs. On V2V4Real, our model (Top-50 object queries) requires only 0.416 Mb bandwidth, 83 times less than SOTA methods, while improving AP70 by 1.1 percent. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth-constrained environments without sacrificing detection accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reduces communication bandwidth in multi-agent perception systems.

Selectively transmits essential features to enhance efficiency.

Improves detection accuracy while minimizing data transmission.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-query-based collaboration framework reduces bandwidth.

Efficient Query Transformer fuses multi-agent object queries.

Synergistic deep supervision enhances stage reinforcement.

🔎 Similar Papers

CorMulT: A Semi-supervised Modality Correlation-aware Multimodal Transformer for Sentiment Analysis