End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of multi-view collaborative perception in V2X scenarios—caused by occlusion, limited field-of-view, and communication latency—this paper proposes XET-V2X, an end-to-end collaborative perception framework. Methodologically: (1) a two-level spatial cross-attention mechanism is introduced, integrated with multi-scale deformable attention to enable efficient alignment across heterogeneous views and modalities (images and point clouds); (2) a novel query-driven point cloud fusion paradigm is proposed, guided by semantic-consistent feature aggregation. The model supports multi-view image feature aggregation and incorporates an end-to-end trainable 3D spatiotemporal Transformer. On the V2X-Seq-SPD and V2X-Sim (V2V/V2I) benchmarks, XET-V2X achieves significant improvements in 3D detection and tracking accuracy, while maintaining high stability under communication latency. Visualization results further demonstrate its strong robustness in complex traffic scenes.

Technology Category

Application Category

📝 Abstract
Multi-view cooperative perception and multimodal fusion are essential for reliable 3D spatiotemporal understanding in autonomous driving, especially under occlusions, limited viewpoints, and communication delays in V2X scenarios. This paper proposes XET-V2X, a multi-modal fused end-to-end tracking framework for v2x collaboration that unifies multi-view multimodal sensing within a shared spatiotemporal representation. To efficiently align heterogeneous viewpoints and modalities, XET-V2X introduces a dual-layer spatial cross-attention module based on multi-scale deformable attention. Multi-view image features are first aggregated to enhance semantic consistency, followed by point cloud fusion guided by the updated spatial queries, enabling effective cross-modal interaction while reducing computational overhead. Experiments on the real-world V2X-Seq-SPD dataset and the simulated V2X-Sim-V2V and V2X-Sim-V2I benchmarks demonstrate consistent improvements in detection and tracking performance under varying communication delays. Both quantitative results and qualitative visualizations indicate that XET-V2X achieves robust and temporally stable perception in complex traffic scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhances 3D perception in autonomous driving via multimodal fusion.
Addresses occlusion and viewpoint limits in V2X cooperative scenarios.
Improves detection and tracking under communication delays.
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end multimodal fusion for 3D tracking
Dual-layer spatial cross-attention aligning viewpoints
Shared spatiotemporal representation unifying V2X collaboration
🔎 Similar Papers
No similar papers found.