🤖 AI Summary
Bridging the domain gap between real-world RGB-D images and robot simulation environments remains challenging for digital twin task generation. Method: This paper proposes a simulation-task alignment framework leveraging vision-language models (VLMs) and an iterative routing mechanism to generate executable simulation tasks end-to-end from single-frame RGB-D input. The method integrates SAM2 for precise object segmentation, VLM-driven semantic understanding, dynamic matching against a simulation asset library, and automated generation of self-validating test suites—forming a closed-loop “perceive–match–generate–verify” optimization pipeline. Contribution/Results: It achieves the first high-fidelity geometric-semantic alignment between real-scene objects and simulation assets while ensuring physical feasibility and executability within physics engines. Evaluated on multiple real-world benchmarks, the approach significantly improves object correspondence accuracy (+23.6%), task success rate (+31.4%), and cross-scene generalization.
📝 Abstract
We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects with simulation-ready assets, and 3) generating appropriate tasks. We ensure simulation-task alignment through generated test suites and introduce a router that iteratively refines both simulation and test code. Experiments demonstrate our system's effectiveness in object correspondence and task environment generation through our novel router mechanism.