🤖 AI Summary
This work addresses the limitations of current large language models (LLMs) in robotic manipulation, particularly their lack of modular execution mechanisms and the absence of systematic benchmarks supporting multi-step reasoning and linguistic diversity. To this end, we propose the ALRM framework, which integrates a ReAct-style reasoning loop and enables language-driven closed-loop planning and interpretable control through two paradigms: Code-as-Policy and Tool-as-Policy. We introduce the first modular agent architecture capable of reflection and correction, alongside a novel simulation benchmark encompassing 56 tasks across multiple environments and languages. Extensive experiments across 10 LLMs demonstrate that ALRM substantially enhances multi-step manipulation performance, with Claude-4.1-Opus (closed-source) and Falcon-H1-7B (open-source) achieving the best results.
📝 Abstract
Large Language Models (LLMs) have recently empowered agentic frameworks to exhibit advanced reasoning and planning capabilities. However, their integration in robotic control pipelines remains limited in two aspects: (1) prior \ac{llm}-based approaches often lack modular, agentic execution mechanisms, limiting their ability to plan, reflect on outcomes, and revise actions in a closed-loop manner; and (2) existing benchmarks for manipulation tasks focus on low-level control and do not systematically evaluate multistep reasoning and linguistic variation. In this paper, we propose Agentic LLM for Robot Manipulation (ALRM), an LLM-driven agentic framework for robotic manipulation. ALRM integrates policy generation with agentic execution through a ReAct-style reasoning loop, supporting two complementary modes: Code-asPolicy (CaP) for direct executable control code generation, and Tool-as-Policy (TaP) for iterative planning and tool-based action execution. To enable systematic evaluation, we also introduce a novel simulation benchmark comprising 56 tasks across multiple environments, capturing linguistically diverse instructions. Experiments with ten LLMs demonstrate that ALRM provides a scalable, interpretable, and modular approach for bridging natural language reasoning with reliable robotic execution. Results reveal Claude-4.1-Opus as the top closed-source model and Falcon-H1-7B as the top open-source model under CaP.