MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors

๐Ÿ“… 2025-07-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Rare vehicle categories in autonomous driving exhibit a long-tailed distribution, severely limiting the generalization capability of perception models. To address this, we propose MultiEditorโ€”a novel framework that, for the first time, leverages 3D Gaussian Splatting (3DGS) as a cross-modal structural and appearance prior to jointly edit images and LiDAR point clouds. Our method introduces a depth-guided deformable cross-modal conditioning module, a multi-level appearance control mechanism, and a dual-branch latent diffusion architecture, integrated with semantic guidance, pixel-level pasting, and multi-stage refinement. Experiments demonstrate that MultiEditor significantly outperforms existing methods in both visual-geometric fidelity and editing controllability. Critically, the synthesized rare-vehicle data substantially improves downstream detector performance on long-tail categories, yielding a +12.7% mAP gain.

Technology Category

Application Category

๐Ÿ“ Abstract
Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism--comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement--to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.
Problem

Research questions and friction points this paper is trying to address.

Editing images and LiDAR data jointly for autonomous driving
Addressing rare vehicle categories in multimodal perception
Enhancing cross-modality consistency with 3D Gaussian Splatting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch latent diffusion framework for joint editing
3D Gaussian Splatting prior for structural appearance
Depth-guided deformable cross-modality condition module
๐Ÿ”Ž Similar Papers
No similar papers found.