SparseCoop: Cooperative Perception with Kinematic-Grounded Queries

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key challenges in cooperative perception—including high communication overhead, inflexible cross-vehicle asynchronous and multi-view spatiotemporal alignment, weak geometric modeling of sparse queries, suboptimal fusion strategies, and training instability—this paper proposes the first fully sparse, BEV-free cooperative perception framework. Our method introduces three core innovations: (1) kinematics-aware instance-level anchor queries for explicit and interpretable spatiotemporal alignment; (2) a coarse-to-fine sparse aggregation module to enhance robustness in cross-vehicle feature fusion; and (3) a cooperative instance denoising training paradigm that improves convergence stability and delay resilience. Evaluated on V2X-Seq and Griffin, our approach achieves state-of-the-art performance in 3D detection and tracking while drastically reducing communication load—only sparse queries are transmitted. It simultaneously delivers high accuracy, low computational cost, and strong temporal robustness.

Technology Category

Application Category

📝 Abstract
Cooperative perception is critical for autonomous driving, overcoming the inherent limitations of a single vehicle, such as occlusions and constrained fields-of-view. However, current approaches sharing dense Bird's-Eye-View (BEV) features are constrained by quadratically-scaling communication costs and the lack of flexibility and interpretability for precise alignment across asynchronous or disparate viewpoints. While emerging sparse query-based methods offer an alternative, they often suffer from inadequate geometric representations, suboptimal fusion strategies, and training instability. In this paper, we propose SparseCoop, a fully sparse cooperative perception framework for 3D detection and tracking that completely discards intermediate BEV representations. Our framework features a trio of innovations: a kinematic-grounded instance query that uses an explicit state vector with 3D geometry and velocity for precise spatio-temporal alignment; a coarse-to-fine aggregation module for robust fusion; and a cooperative instance denoising task to accelerate and stabilize training. Experiments on V2X-Seq and Griffin datasets show SparseCoop achieves state-of-the-art performance. Notably, it delivers this with superior computational efficiency, low transmission cost, and strong robustness to communication latency. Code is available at https://github.com/wang-jh18-SVM/SparseCoop.
Problem

Research questions and friction points this paper is trying to address.

Overcomes single vehicle limitations like occlusions and constrained fields-of-view
Addresses high communication costs and alignment issues in cooperative perception
Improves geometric representation and training stability in sparse query methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kinematic-grounded instance query for precise spatio-temporal alignment
Coarse-to-fine aggregation module for robust fusion
Cooperative instance denoising task to accelerate training
🔎 Similar Papers
No similar papers found.
J
Jiahao Wang
Tsinghua University
Z
Zhongwei Jiang
Nanyang Technological University
W
Wenchao Sun
Tsinghua University
J
Jiaru Zhong
The Hong Kong Polytechnic University
H
Haibao Yu
The University of Hong Kong
Y
Yuner Zhang
University of Pennsylvania
C
Chenyang Lu
Tsinghua University
Chuang Zhang
Chuang Zhang
Tsinghua University
Autonomous DrivingIntelligent Connected Vehicle
L
Lei He
Tsinghua University
S
Shaobing Xu
Tsinghua University
Jianqiang Wang
Jianqiang Wang
Associate Professor of Library and Information Studies, University at Buffalo
Information Retrievale-discovery