🤖 AI Summary
General-purpose cognitive robots suffer from poor robustness to execution errors and weak cross-modal and cross-platform generalization. Method: We propose an embodied manipulation framework integrating iterative reasoning, diffusion-based video generation, and closed-loop execution. Specifically: (i) we introduce a generative reasoning mechanism that synthesizes candidate action videos directly from text instructions; (ii) we design a multi-round re-planning strategy with joint visual–motor modeling for closed-loop feedback control; and (iii) we support multi-view perception and deployment across heterogeneous robot platforms (real-sim co-deployment). Results: Our framework achieves an initial success rate of 20–30%, improving to 80% overall after iterative correction—reaching up to 83% on human-familiar tasks—significantly outperforming state-of-the-art methods. It establishes a scalable, robust, and modality- and morphology-agnostic paradigm for autonomous embodied intelligence.
📝 Abstract
We introduce PhysicalAgent, an agentic framework for robotic manipulation that integrates iterative reasoning, diffusion-based video generation, and closed-loop execution. Given a textual instruction, our method generates short video demonstrations of candidate trajectories, executes them on the robot, and iteratively re-plans in response to failures. This approach enables robust recovery from execution errors. We evaluate PhysicalAgent across multiple perceptual modalities (egocentric, third-person, and simulated) and robotic embodiments (bimanual UR3, Unitree G1 humanoid, simulated GR1), comparing against state-of-the-art task-specific baselines. Experiments demonstrate that our method consistently outperforms prior approaches, achieving up to 83% success on human-familiar tasks. Physical trials reveal that first-attempt success is limited (20-30%), yet iterative correction increases overall success to 80% across platforms. These results highlight the potential of video-based generative reasoning for general-purpose robotic manipulation and underscore the importance of iterative execution for recovering from initial failures. Our framework paves the way for scalable, adaptable, and robust robot control.