🤖 AI Summary
Visual disturbances in real-world scenes severely compromise the robustness and safety of robotic manipulation. To address this, we propose NICE, the first framework that synergistically integrates generative image inpainting with large language models (LLMs) to enhance vision–motor policies for robotics—without requiring additional data collection or model fine-tuning. NICE enables object replacement, style retargeting, and distractor removal while preserving spatial relationships and action-label consistency. By synthesizing diverse visual experiences from existing demonstration data, it effectively mitigates distribution shift. Leveraging vision-language models (VLMs) and vision-language-action (VLA) policies, NICE supports spatial affordance prediction and end-to-end manipulation decision-making. Experiments in cluttered environments demonstrate a 20.3% improvement in affordance prediction accuracy, an average 11.2% increase in task success rate, a 6.1% reduction in target confusion, and a 7.4% decrease in collision rate.
📝 Abstract
Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors can significantly degrade performance and safety. In this work, we propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE). Our method minimizes out-of-distribution (OOD) gap in imitation learning by increasing visual diversity through construction of new experiences using existing demonstrations. By utilizing image generative frameworks and large language models, NICE performs three editing operations, object replacement, restyling, and removal of distracting (non-target) objects. These changes preserve spatial relationships without obstructing target objects and maintain action-label consistency. Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets.
Using real-world scenes, we showcase the capability of our framework in producing photo-realistic scene enhancement. For downstream tasks, we use NICE data to finetune a vision-language model (VLM) for spatial affordance prediction and a vision-language-action (VLA) policy for object manipulation. Our evaluations show that NICE successfully minimizes OOD gaps, resulting in over 20% improvement in accuracy for affordance prediction in highly cluttered scenes. For manipulation tasks, success rate increases on average by 11% when testing in environments populated with distractors in different quantities. Furthermore, we show that our method improves visual robustness, lowering target confusion by 6%, and enhances safety by reducing collision rate by 7%.