π€ AI Summary
To address the critical challenges of weak physical grounding, hallucination susceptibility, and poor long-horizon physical consistency in robotic manipulation planning, this paper proposes the Embodied Tree-of-Thought (EToT) framework. EToT employs a physics-based digital twin as its reasoning substrate, integrating domain priors with a reflective branching mechanism: it conducts tree search within a simulated environment to predict action outcomes and iteratively refine manipulation trajectories; failure cases are diagnosed by a vision-language model (VLM), which generates corrective strategies. This work pioneers the deep coupling of embodied world models with Tree-of-Thought search and establishes a Real2Sim2Real transferιη―. Experiments demonstrate that EToT significantly outperforms existing baselines across diverse short- and long-horizon manipulation tasks, markedly improving physical dynamics prediction accuracy, fault recovery capability, and overall task success rate.
π Abstract
World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .