Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

In open-world robotic manipulation, existing vision-language models (VLMs) exhibit strong high-level semantic generalization but lack physical grounding, hindering their ability to generate executable low-level control commands. Method: We propose a closed-loop control framework that tightly couples VLMs with a dynamic, interactive digital twin environment. The digital twin provides physically accurate, intervenable simulation; multi-view, occlusion-free neural rendering enhances the VLM’s understanding of complex physical interactions; and the VLM’s semantic plans serve as future-observation prompts to guide physics-based model predictive control (MPC). Contribution/Results: This work introduces the first tightly integrated VLM–digital twin architecture for robotic control. Evaluated across diverse manipulation tasks, our approach significantly outperforms purely VLM-driven baselines, achieving high success rates, physically feasible trajectories, and precise alignment between natural language semantics and end-to-end control outputs.

Technology Category

Application Category

📝 Abstract

Recent advancements in open-world robot manipulation have been largely driven by vision-language models (VLMs). While these models exhibit strong generalization ability in high-level planning, they struggle to predict low-level robot controls due to limited physical-world understanding. To address this issue, we propose a model predictive control framework for open-world manipulation that combines the semantic reasoning capabilities of VLMs with physically-grounded, interactive digital twins of the real-world environments. By constructing and simulating the digital twins, our approach generates feasible motion trajectories, simulates corresponding outcomes, and prompts the VLM with future observations to evaluate and select the most suitable outcome based on language instructions of the task. To further enhance the capability of pre-trained VLMs in understanding complex scenes for robotic control, we leverage the flexible rendering capabilities of the digital twin to synthesize the scene at various novel, unoccluded viewpoints. We validate our approach on a diverse set of complex manipulation tasks, demonstrating superior performance compared to baseline methods for language-conditioned robotic control using VLMs.

Problem

Research questions and friction points this paper is trying to address.

Combining VLMs with digital twins for robot control

Generating feasible motion trajectories via simulation

Enhancing VLM scene understanding with novel viewpoints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines VLMs with interactive digital twins

Generates motion trajectories via simulation

Synthesizes scenes from novel viewpoints

🔎 Similar Papers

No similar papers found.