🤖 AI Summary
Addressing the challenges of scarce, low-cost demonstration data for large-scale robotic manipulation and the persistent sim-to-real gap, this paper proposes Editable Video Simulation (EVS): a video-based simulation framework that takes multi-view photorealistic videos as input and enables object-level interaction and controllable simulation. We introduce Incremental Semantic Distillation (ISD) and a 3D-regularized Neural Radiance Field Matching (3D-NNFM) loss to jointly enhance geometric and semantic consistency. Furthermore, we integrate large language models (LLMs) for natural-language-driven automated scene generation and vision-language models (VLMs) to identify learning bottlenecks for closed-loop optimization. The framework unifies 3D Gaussian Splatting (3DGS), ISD, 3D-NNFM, LLMs, and VLMs. Extensive evaluation on RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and real-world robotic platforms demonstrates significant improvements in simulation fidelity and cross-domain policy transfer performance.
📝 Abstract
The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.