CORE: Full-Path Evaluation of LLM Agents Beyond Final State

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing AI agent evaluation relies predominantly on binary, final-state correctness judgments, neglecting critical dimensions such as safety, execution efficiency, and intermediate-step validity. Method: We propose a task modeling paradigm grounded in deterministic finite automata (DFA), enabling structured representation of task specifications. Building upon this, we introduce a five-dimensional fine-grained evaluation framework assessing path correctness, ordering consistency, prefix criticality, hazardous API call rate, and execution efficiency. Our methodology integrates Kendall rank correlation, exact path-matching algorithms, and risk-aware API call detection to enable quantifiable, interpretable analysis of the entire function-call sequence. Contribution/Results: Experiments across diverse environments demonstrate that our framework effectively discriminates between agents exhibiting comparable performance under conventional metrics—thereby significantly improving assessment accuracy and discriminative power. This work establishes a novel, principled paradigm for evaluating the reliability of large language model–based agents.

Technology Category

Application Category

📝 Abstract

Evaluating AI agents that solve real-world tasks through function-call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool-use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall's tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents using function-call sequences beyond final state

Overcoming limitations of binary final-state evaluation in agent benchmarks

Assessing safety, efficiency and intermediate correctness in agent behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

DFA-based framework for task encoding

Five-metric suite for path evaluation

Multi-dimensional assessment beyond final state

🔎 Similar Papers

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates