🤖 AI Summary
Addressing the challenges of acquiring real-world multimodal (RGB + LiDAR) data for safety-critical applications such as autonomous driving—and the limited realism and spatial controllability of synthetic alternatives—this paper introduces the first diffusion-based framework conditioned on 3D bounding boxes for camera-LiDAR co-located object insertion guided by a single RGB reference image. Our method features: (1) 3D bounding box-driven spatial conditioning, replacing conventional masks to eliminate geometric ambiguity; (2) cross-modal feature alignment and consistency constraints to ensure geometric fidelity and semantic coherence; and (3) a joint sensor-generation architecture enabling scale-adaptive and synchronized cross-modal control. Evaluated on real automotive datasets, our approach improves SSIM and cross-modal consistency metrics by over 27% versus baselines, significantly enhancing robustness testing coverage for perception models.
📝 Abstract
Safety-critical applications, such as autonomous driving, require extensive multimodal data for rigorous testing. Methods based on synthetic data are gaining prominence due to the cost and complexity of gathering real-world data but require a high degree of realism and controllability in order to be useful. This paper introduces MObI, a novel framework for Multimodal Object Inpainting that leverages a diffusion model to create realistic and controllable object inpaintings across perceptual modalities, demonstrated for both camera and lidar simultaneously. Using a single reference RGB image, MObI enables objects to be seamlessly inserted into existing multimodal scenes at a 3D location specified by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, our 3D bounding box conditioning gives objects accurate spatial positioning and realistic scaling. As a result, our approach can be used to insert novel objects flexibly into multimodal scenes, providing significant advantages for testing perception models.