🤖 AI Summary
Existing AI agent evaluation relies predominantly on binary, final-state correctness judgments, neglecting critical dimensions such as safety, execution efficiency, and intermediate-step validity.
Method: We propose a task modeling paradigm grounded in deterministic finite automata (DFA), enabling structured representation of task specifications. Building upon this, we introduce a five-dimensional fine-grained evaluation framework assessing path correctness, ordering consistency, prefix criticality, hazardous API call rate, and execution efficiency. Our methodology integrates Kendall rank correlation, exact path-matching algorithms, and risk-aware API call detection to enable quantifiable, interpretable analysis of the entire function-call sequence.
Contribution/Results: Experiments across diverse environments demonstrate that our framework effectively discriminates between agents exhibiting comparable performance under conventional metrics—thereby significantly improving assessment accuracy and discriminative power. This work establishes a novel, principled paradigm for evaluating the reliability of large language model–based agents.
📝 Abstract
Evaluating AI agents that solve real-world tasks through function-call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool-use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall's tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.