🤖 AI Summary
Existing roadside cooperative perception methods overemphasize model architecture design while neglecting critical data-level challenges—such as calibration errors, sparse information, and multi-view inconsistency—leading to suboptimal real-world performance. To address this, we propose the first end-to-end simulation framework specifically tailored for roadside cooperative perception. Our method introduces a novel single-image-driven dynamic foreground editing paradigm coupled with full-scene style transfer; proposes DepthSAM (depth-guided single-frame multi-view consistency modeling) and MOAS (occlusion-aware multi-view sampler); and establishes a complete simulation pipeline encompassing extrinsic parameter joint optimization, 3D asset placement, foreground consistency modeling, and stylized post-processing. Evaluated on Rcooper-Intersection and TUMTraf-V2X, our approach achieves 3D detection AP₇₀ of 83.74 and 83.12, respectively—significantly surpassing state-of-the-art methods and filling a critical gap in roadside perception simulation. Code and pre-trained models will be publicly released.
📝 Abstract
Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for road-side collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and full-scene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimization ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by 83.74 on Rcooper-Intersection and 83.12 on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code and pre-trained models will be released soon: https://github.com/duyuwen-duen/RoCo-Sim