🤖 AI Summary
This work addresses the unreliability of large language models (LLMs) in executing structured workflows specified through natural language. To overcome this limitation, the authors propose RunAgent, a multi-agent platform that integrates the expressiveness of natural language with the determinism of programmatic execution through a novel agent language. RunAgent introduces a constraint-guided stepwise execution mechanism augmented with explicit control structures, dynamic selection of reasoning strategies, and context filtering to ensure robust task execution. The framework automatically derives verifiable constraints and supports a hybrid paradigm combining tool invocation, Python code generation, and execution. Evaluated on the NaturalPlan and SciBench benchmarks, RunAgent substantially outperforms both baseline LLMs and the current state-of-the-art PlanGEN method.
📝 Abstract
Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through constraints and rubrics. RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., \texttt{IF}, \texttt{GOTO}, \texttt{FORALL}). Beyond verifying syntactic and semantic verification of the step output, which is performed based on the specific instruction of each step, RunAgent autonomously derives and validates constraints based on the description of the task and its instance at each step. RunAgent also dynamically selects among LLM-based reasoning, tool usage, and code generation and execution (e.g., in Python), and incorporates error correction mechanisms to ensure correctness. Finally, RunAgent filters the context history by retaining only relevant information during the execution of each step. Evaluations on Natural-plan and SciBench Datasets demonstrate that RunAgent outperforms baseline LLMs and state-of-the-art PlanGEN methods.