🤖 AI Summary
Current evaluations of LLM agents for multi-turn tool use rely on expensive, deterministic backends that are difficult to iterate, hindering reliable model comparison and training data generation. This work proposes the first proxy-state-based evaluation framework, which employs an LLM-powered state tracker to infer structured agent states from interaction trajectories and uses an LLM judge to verify task completion and detect hallucinations against scenario-specific constraints—enabling high-fidelity assessment without a deterministic backend. The approach supports scalable evaluation, user-role sensitivity analysis, and both on- and off-policy rollout supervision. Experiments demonstrate its ability to consistently distinguish between model families and reasoning intensities, generalize automatically generated supervision signals to unseen scenarios, achieve near-zero simulator hallucination rates, and attain over 90% agreement with human judgments.
📝 Abstract
Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.