π€ AI Summary
This work addresses the insufficient robustness of cooperative perception in vehicle-infrastructure systems under communication channel impairments such as noise, fading, and interference. To this end, the authors propose a Transformer-based adaptive feature fusion framework that models temporal correlations through multi-agent temporal aggregation, captures inter-agent and spatial dependencies via a dual-path spatial attention mechanism, and compensates for degraded information using an uncertainty-guided feature fusion strategy. Additionally, a teacherβstudent knowledge distillation scheme is employed to further enhance performance. Experimental results on the V2XSet and DAIR-V2X datasets demonstrate that the proposed method consistently outperforms existing approaches under both ideal and impaired communication conditions, achieving a superior balance between accuracy and robustness while maintaining computational efficiency.
π Abstract
Accurate 3D object detection is essential for ensuring the safety of autonomous vehicles. Cooperative perception, which leverages vehicle-to-everything (V2X) communication to share perceptual data, enhances detection but is vulnerable to channel impairments, such as noise, fading, and interference. To strengthen the reliability of intelligent transportation systems, this work improves the robustness of V2X cooperative perception under communication conditions that reflect common channel impairments. This paper proposes an Adaptive Feature Fusion Transformer (AFFormer), a Transformer-based framework that mitigates the adverse effects of corrupted features by modeling temporal, inter-agent, and spatial correlations. AFFormer introduces three key modules: Multi-Agent and Temporal Aggregation for context-aware fusion across agents and over time, Dual Spatial Attention for efficient modeling of spatial dependencies, and Uncertainty-Guided Fusion for entropy-driven refinement of fused features. A teacher-student knowledge distillation strategy further enhances robustness by aligning fused features with reliable early-collaboration supervision. AFFormer is validated on the V2XSet and DAIR-V2X datasets, where it consistently outperforms existing methods under both ideal and impaired communication conditions, demonstrating improved robustness to communication-induced feature degradation while maintaining a competitive efficiency-accuracy trade-off.