🤖 AI Summary
Large language model agents often exhibit insufficient reliability in long-horizon tasks due to policy violations, tool hallucination, and deviation from user intent. This work proposes the NOD architecture, a heterogeneous multi-agent framework comprising Navigator, Operator, and Director roles, which explicitly models a structured global state to track task progress and introduces an independent Director agent to validate and intervene before critical actions, thereby preventing error propagation. By integrating these mechanisms, the approach significantly enhances decision consistency and execution safety. Evaluated on the τ²-Bench benchmark, NOD achieves substantial improvements in both overall task success rate and precision of critical actions, while effectively mitigating policy violations, tool hallucination, and misalignment with user intentions.
📝 Abstract
Large language model (LLM) agents have increasingly advanced service applications, such as booking flight tickets. However, these service agents suffer from unreliability in long-horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, which greatly impedes their real-world deployment. To address these challenges, we propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture for service agents. Instead of maintaining task state implicitly in dialogue context as in prior work, we externalize a structured Global State to enable explicit task state tracking and consistent decision-making by the Navigator. Besides, we introduce selective external oversight before critical actions, allowing an independent Director agent to verify execution and intervene when necessary. As such, NOD effectively mitigates error propagation and unsafe behavior in long-horizon tasks. Experiments on $τ^2$-Bench demonstrate that NOD achieves higher task success rates and critical action precision over baselines. More importantly, NOD improves the reliability of service agents by reducing policy violations, tool hallucinations, and user-intent misalignment.