🤖 AI Summary
This work addresses the challenge of error accumulation in open-source large language models during multi-turn tool-use tasks, which often leads to task failure due to limited parameter scale, constrained context windows, and restricted inference resources. To mitigate this, the authors propose a failure-aware, two-stage meta-agent framework: it first analyzes failure trajectories of a baseline agent to identify recurring error patterns, then dynamically activates a minimal set of specialized agents via a lightweight orchestration mechanism to inject precise contextual interventions at critical decision points. This targeted approach significantly enhances robustness without requiring extensive auxiliary models. Experimental results demonstrate that the method improves task success rates by up to 27% over standard baselines across diverse evaluation settings, underscoring the efficacy of context-sensitive intervention in building reliable conversational agents.
📝 Abstract
Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure-Aware Meta-Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; second, it employs an orchestration mechanism that activates a minimal subset of specialized agents tailored to address these failures by injecting a targeted context for the tool-use agent before the decision-making step. Experiments across open-source LLMs demonstrate performance gains up to 27% across evaluation modes over standard baselines. These results highlight that targeted curation of context through specialized agents to address common failures is a valuable design principle for building reliable, multi-turn tool-use LLM agents that simulate real-world conversational scenarios.