MObI: Multimodal Object Inpainting Using Diffusion Models

📅 2025-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of acquiring real-world multimodal (RGB + LiDAR) data for safety-critical applications such as autonomous driving—and the limited realism and spatial controllability of synthetic alternatives—this paper introduces the first diffusion-based framework conditioned on 3D bounding boxes for camera-LiDAR co-located object insertion guided by a single RGB reference image. Our method features: (1) 3D bounding box-driven spatial conditioning, replacing conventional masks to eliminate geometric ambiguity; (2) cross-modal feature alignment and consistency constraints to ensure geometric fidelity and semantic coherence; and (3) a joint sensor-generation architecture enabling scale-adaptive and synchronized cross-modal control. Evaluated on real automotive datasets, our approach improves SSIM and cross-modal consistency metrics by over 27% versus baselines, significantly enhancing robustness testing coverage for perception models.

Technology Category

Application Category

📝 Abstract
Safety-critical applications, such as autonomous driving, require extensive multimodal data for rigorous testing. Methods based on synthetic data are gaining prominence due to the cost and complexity of gathering real-world data but require a high degree of realism and controllability in order to be useful. This paper introduces MObI, a novel framework for Multimodal Object Inpainting that leverages a diffusion model to create realistic and controllable object inpaintings across perceptual modalities, demonstrated for both camera and lidar simultaneously. Using a single reference RGB image, MObI enables objects to be seamlessly inserted into existing multimodal scenes at a 3D location specified by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, our 3D bounding box conditioning gives objects accurate spatial positioning and realistic scaling. As a result, our approach can be used to insert novel objects flexibly into multimodal scenes, providing significant advantages for testing perception models.
Problem

Research questions and friction points this paper is trying to address.

Autonomous Driving
Visual and Perception Data
Computer-generated Data Realism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Models
Multi-sensor Integration
Object Insertion
🔎 Similar Papers
No similar papers found.
Alexandru Buburuzan
Alexandru Buburuzan
DPhil student, University of Oxford
Computer VisionMultimodal Learning
A
Anuj Sharma
FiveAI
J
John Redford
FiveAI
P
P. Dokania
FiveAI, University of Oxford
Romain Mueller
Romain Mueller
Anthropic