CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-agent collaborative perception, sequential fusion across agents and temporal steps leads to suboptimal efficiency and accuracy. To address this, we propose CoST, a unified spatiotemporal collaborative perception framework. Its core innovation lies in jointly modeling cross-agent and cross-temporal feature fusion within a single shared spatiotemporal latent space, enabling one-shot feature transmission and end-to-end jointly optimized aggregation. Built upon a spatiotemporal Transformer architecture, CoST supports end-to-end training and is compatible with mainstream collaborative perception methods. Experiments demonstrate that CoST achieves state-of-the-art (SOTA) perception accuracy while reducing communication bandwidth by 37–52% and inference latency by 28–41%, significantly enhancing both efficiency and robustness in complex, dynamic environments.

Technology Category

Application Category

📝 Abstract
Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultanesouly. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatial-temporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth.
Problem

Research questions and friction points this paper is trying to address.

Unifies multi-agent and multi-time fusion for efficient perception
Reduces redundant feature transmission in collaborative perception
Enhances perception accuracy in occlusions and small sensing ranges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified spatio-temporal space aggregation
Efficient single transmission per object
Compatible with prior collaborative methods
🔎 Similar Papers
No similar papers found.