PhysiAgent: An Embodied Agent Framework in Physical World

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Current Vision-Language-Action (VLA) models suffer from poor generalization, while serial collaboration between Vision-Language Models (VLMs) and VLAs—where VLMs solely handle planning and VLAs only execution—leads to grounding failures and inefficient coordination. This paper proposes an embodied agent framework enabling dynamic, closed-loop interaction among task understanding, decision-making, and execution via synergistic VLM–VLA cooperation. Its core innovations include a monitor-memory-self-reflection mechanism for real-time inter-model feedback, capability assessment, and adaptive guidance; an autonomous prompting framework coupled with a lightweight toolbox to enhance environmental grounding and tool orchestration. Evaluated on real-world robotic tasks, the framework significantly improves task success rates, demonstrating effectiveness in self-regulation, execution adaptability, and cross-modal coordination.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations. To address this, integrating generalized Vision-Language Models (VLMs) as assistants to VLAs has emerged as a popular solution. However, current approaches often combine these models in rigid, sequential structures: using VLMs primarily for high-level scene understanding and task planning, and VLAs merely as executors of lower-level actions, leading to ineffective collaboration and poor grounding challenges. In this paper, we propose an embodied agent framework, PhysiAgent, tailored to operate effectively in physical environments. By incorporating monitor, memory, self-reflection mechanisms, and lightweight off-the-shelf toolboxes, PhysiAgent offers an autonomous scaffolding framework to prompt VLMs to organize different components based on real-time proficiency feedback from VLAs to maximally exploit VLAs' capabilities. Experimental results demonstrate significant improvements in task-solving performance on complex real-world robotic tasks, showcasing effective self-regulation of VLMs, coherent tool collaboration, and adaptive evolution of the framework during execution. PhysiAgent makes practical and pioneering efforts to integrate VLMs and VLAs, effectively grounding embodied agent frameworks in real-world settings.

Problem

Research questions and friction points this paper is trying to address.

Enhancing collaboration between vision-language models and action models

Addressing limited generalization in embodied agent frameworks

Improving real-world task performance through adaptive model integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates monitor, memory, and self-reflection mechanisms

Uses real-time proficiency feedback to organize components

Enables adaptive evolution during task execution

🔎 Similar Papers

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI