🤖 AI Summary
In multi-agent collaborative perception, sequential fusion across agents and temporal steps leads to suboptimal efficiency and accuracy. To address this, we propose CoST, a unified spatiotemporal collaborative perception framework. Its core innovation lies in jointly modeling cross-agent and cross-temporal feature fusion within a single shared spatiotemporal latent space, enabling one-shot feature transmission and end-to-end jointly optimized aggregation. Built upon a spatiotemporal Transformer architecture, CoST supports end-to-end training and is compatible with mainstream collaborative perception methods. Experiments demonstrate that CoST achieves state-of-the-art (SOTA) perception accuracy while reducing communication bandwidth by 37–52% and inference latency by 28–41%, significantly enhancing both efficiency and robustness in complex, dynamic environments.
📝 Abstract
Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultanesouly. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatial-temporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth.