RoboPearls: Editable Video Simulation for Robot Manipulation

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Addressing the challenges of scarce, low-cost demonstration data for large-scale robotic manipulation and the persistent sim-to-real gap, this paper proposes Editable Video Simulation (EVS): a video-based simulation framework that takes multi-view photorealistic videos as input and enables object-level interaction and controllable simulation. We introduce Incremental Semantic Distillation (ISD) and a 3D-regularized Neural Radiance Field Matching (3D-NNFM) loss to jointly enhance geometric and semantic consistency. Furthermore, we integrate large language models (LLMs) for natural-language-driven automated scene generation and vision-language models (VLMs) to identify learning bottlenecks for closed-loop optimization. The framework unifies 3D Gaussian Splatting (3DGS), ISD, 3D-NNFM, LLMs, and VLMs. Extensive evaluation on RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and real-world robotic platforms demonstrates significant improvements in simulation fidelity and cross-domain policy transfer performance.

Technology Category

Application Category

📝 Abstract

The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.

Problem

Research questions and friction points this paper is trying to address.

High cost and inefficiency of collecting real-world robot demonstrations

Challenges in bridging the sim-to-real gap in robotic learning

Need for automated, user-friendly simulation production for robot manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting for realistic simulations

Large language models automate simulation production

Vision-language model enhances robotic learning analysis

🔎 Similar Papers

Vision-based Manipulation from Single Human Video with Open-World Object Graphs