🤖 AI Summary
In V2X autonomous driving, single-vehicle perception suffers from occlusions, limited line-of-sight, and narrow fields-of-view, leading to incomplete and inaccurate 3D semantic occupancy prediction—and no dedicated collaborative benchmark exists. To address this, we introduce V2X-Sim, the first synthetic benchmark for cooperative 3D semantic occupancy prediction in vehicle-infrastructure systems. Built on CARLA, it generates dense, complete voxel-level geometric and semantic ground truth by replaying collaborative perception sequences and deploying high-resolution semantic voxel sensors. We propose the first cooperative 3D semantic occupancy prediction framework supporting multi-scale evaluation, featuring a baseline model that integrates spatial alignment and cross-agent attention-based feature aggregation. Experiments demonstrate consistent and significant performance gains over single-agent methods across all prediction ranges—especially at long distances—validating the efficacy of collaborative perception for robust, long-range occupancy forecasting.
📝 Abstract
3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands.