๐ค AI Summary
Rare vehicle categories in autonomous driving exhibit a long-tailed distribution, severely limiting the generalization capability of perception models. To address this, we propose MultiEditorโa novel framework that, for the first time, leverages 3D Gaussian Splatting (3DGS) as a cross-modal structural and appearance prior to jointly edit images and LiDAR point clouds. Our method introduces a depth-guided deformable cross-modal conditioning module, a multi-level appearance control mechanism, and a dual-branch latent diffusion architecture, integrated with semantic guidance, pixel-level pasting, and multi-stage refinement. Experiments demonstrate that MultiEditor significantly outperforms existing methods in both visual-geometric fidelity and editing controllability. Critically, the synthesized rare-vehicle data substantially improves downstream detector performance on long-tail categories, yielding a +12.7% mAP gain.
๐ Abstract
Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism--comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement--to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.