PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the challenge of embodied agents accurately interpreting and executing complex natural-language instructions in real-world settings, this paper proposes a vision-language fused closed-loop feedback framework. It integrates speech recognition, multimodal instruction parsing, online task planning, and execution feedback evaluation. The approach introduces an end-to-end embodied control paradigm that enables precise mapping from high-level semantic instructions to physical actions. Compared to the LLM+CLIP baseline, our method achieves a 28% improvement in average task success rate across both simulated and real-world environments. It significantly enhances generalization to high-level, long-horizon, and multi-step natural-language tasks, while improving robustness in execution. By ensuring interpretability, iterative refinement, and deployability, the framework establishes a novel pathway toward human-centered AI for embodied intelligence.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Large Language Models (LLMs) has marked a significant breakthrough in Artificial Intelligence (AI), ushering in a new era of Human-centered Artificial Intelligence (HAI). HAI aims to better serve human welfare and needs, thereby placing higher demands on the intelligence level of robots, particularly in aspects such as natural language interaction, complex task planning, and execution. Intelligent agents powered by LLMs have opened up new pathways for realizing HAI. However, existing LLM-based embodied agents often lack the ability to plan and execute complex natural language control tasks online. This paper explores the implementation of intelligent robotic manipulating agents based on Vision-Language Models (VLMs) in the physical world. We propose a novel embodied agent framework for robots, which comprises a human-robot voice interaction module, a vision-language agent module and an action execution module. The vision-language agent itself includes a vision-based task planner, a natural language instruction converter, and a task performance feedback evaluator. Experimental results demonstrate that our agent achieves a 28% higher average task success rate in both simulated and real environments compared to approaches relying solely on LLM+CLIP, significantly improving the execution success rate of high-level natural language instruction tasks.

Problem

Research questions and friction points this paper is trying to address.

LLM-based agents lack online complex task planning

Embodied agents need improved natural language execution

Robots require better human-centered AI interaction capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based embodied agent with voice interaction

Vision-language module for task planning and feedback

Integrated framework for natural language instruction execution

🔎 Similar Papers

No similar papers found.