🤖 AI Summary
To address asynchronous multi-agent feature misalignment and semantic inconsistency in vehicle-to-vehicle (V2V) cooperative perception caused by communication latency, this paper proposes a feature-level trajectory modeling framework. It formulates motion compensation as a spatiotemporally continuous attention path, enabling temporally ordered sampling and semantic alignment of historical features for the current query. The core contributions include: (i) the first Transformer-based trajectory-aware attention mechanism; (ii) a differentiable temporal sampling module; and (iii) a cross-frame feature propagation and reconstruction network. This unified approach jointly resolves both spatial and semantic misalignment, significantly improving consistency and real-time performance in asynchronous feature fusion. Extensive experiments demonstrate state-of-the-art performance on the V2V4Real and DAIR-V2X-Seq benchmarks, establishing a new paradigm for asynchronous cooperative perception.
📝 Abstract
Cooperative perception presents significant potential for enhancing the sensing capabilities of individual vehicles, however, inter-agent latency remains a critical challenge. Latencies cause misalignments in both spatial and semantic features, complicating the fusion of real-time observations from the ego vehicle with delayed data from others. To address these issues, we propose TraF-Align, a novel framework that learns the flow path of features by predicting the feature-level trajectory of objects from past observations up to the ego vehicle's current time. By generating temporally ordered sampling points along these paths, TraF-Align directs attention from the current-time query to relevant historical features along each trajectory, supporting the reconstruction of current-time features and promoting semantic interaction across multiple frames. This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion and achieving coherent feature fusion. Experiments on two real-world datasets, V2V4Real and DAIR-V2X-Seq, show that TraF-Align sets a new benchmark for asynchronous cooperative perception.