PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose cognitive robots suffer from poor robustness to execution errors and weak cross-modal and cross-platform generalization. Method: We propose an embodied manipulation framework integrating iterative reasoning, diffusion-based video generation, and closed-loop execution. Specifically: (i) we introduce a generative reasoning mechanism that synthesizes candidate action videos directly from text instructions; (ii) we design a multi-round re-planning strategy with joint visual–motor modeling for closed-loop feedback control; and (iii) we support multi-view perception and deployment across heterogeneous robot platforms (real-sim co-deployment). Results: Our framework achieves an initial success rate of 20–30%, improving to 80% overall after iterative correction—reaching up to 83% on human-familiar tasks—significantly outperforming state-of-the-art methods. It establishes a scalable, robust, and modality- and morphology-agnostic paradigm for autonomous embodied intelligence.

Technology Category

Application Category

📝 Abstract
We introduce PhysicalAgent, an agentic framework for robotic manipulation that integrates iterative reasoning, diffusion-based video generation, and closed-loop execution. Given a textual instruction, our method generates short video demonstrations of candidate trajectories, executes them on the robot, and iteratively re-plans in response to failures. This approach enables robust recovery from execution errors. We evaluate PhysicalAgent across multiple perceptual modalities (egocentric, third-person, and simulated) and robotic embodiments (bimanual UR3, Unitree G1 humanoid, simulated GR1), comparing against state-of-the-art task-specific baselines. Experiments demonstrate that our method consistently outperforms prior approaches, achieving up to 83% success on human-familiar tasks. Physical trials reveal that first-attempt success is limited (20-30%), yet iterative correction increases overall success to 80% across platforms. These results highlight the potential of video-based generative reasoning for general-purpose robotic manipulation and underscore the importance of iterative execution for recovering from initial failures. Our framework paves the way for scalable, adaptable, and robust robot control.
Problem

Research questions and friction points this paper is trying to address.

Develops general cognitive robotics with foundation world models
Enables robust recovery from robotic execution errors
Evaluates across multiple perceptual modalities and embodiments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative reasoning with diffusion video generation
Closed-loop execution for error recovery
Multi-modal evaluation across robotic embodiments
🔎 Similar Papers
No similar papers found.