RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing roadside cooperative perception methods overemphasize model architecture design while neglecting critical data-level challenges—such as calibration errors, sparse information, and multi-view inconsistency—leading to suboptimal real-world performance. To address this, we propose the first end-to-end simulation framework specifically tailored for roadside cooperative perception. Our method introduces a novel single-image-driven dynamic foreground editing paradigm coupled with full-scene style transfer; proposes DepthSAM (depth-guided single-frame multi-view consistency modeling) and MOAS (occlusion-aware multi-view sampler); and establishes a complete simulation pipeline encompassing extrinsic parameter joint optimization, 3D asset placement, foreground consistency modeling, and stylized post-processing. Evaluated on Rcooper-Intersection and TUMTraf-V2X, our approach achieves 3D detection AP₇₀ of 83.74 and 83.12, respectively—significantly surpassing state-of-the-art methods and filling a critical gap in roadside perception simulation. Code and pre-trained models will be publicly released.

Technology Category

Application Category

📝 Abstract
Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for road-side collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and full-scene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimization ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by 83.74 on Rcooper-Intersection and 83.12 on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code and pre-trained models will be released soon: https://github.com/duyuwen-duen/RoCo-Sim
Problem

Research questions and friction points this paper is trying to address.

Addresses calibration errors and sparse data in roadside perception.
Improves multi-view consistency in collaborative perception systems.
Enhances 3D object detection accuracy using simulated roadside data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic foreground editing for multi-view consistency
Camera Extrinsic Optimization for accurate 3D projection
DepthSAM for foreground-background relationship modeling
🔎 Similar Papers
No similar papers found.
Yuwen Du
Yuwen Du
Shanghai Jiao Tong University
Multi-AgentAutonomous Driving Simulation
Anning Hu
Anning Hu
Professor of Sociology, Fudan University, Shanghai, China
InequalityCultureMethodology
Z
Zichen Chao
Nanjing University of Science and Technology
Y
Yifan Lu
Shanghai Jiao Tong University
Junhao Ge
Junhao Ge
Shanghai Jiaotong University
Autonomous Driving
G
Genjia Liu
Shanghai Jiao Tong University
W
Weitao Wu
Nanjing University of Science and Technology
L
Lanjun Wang
Tianjin University
Siheng Chen
Siheng Chen
Shanghai Jiao Tong University
Collective intelligenceLLM agentgraph signal processingcollaborative perception