CORE: Full-Path Evaluation of LLM Agents Beyond Final State

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI agent evaluation relies predominantly on binary, final-state correctness judgments, neglecting critical dimensions such as safety, execution efficiency, and intermediate-step validity. Method: We propose a task modeling paradigm grounded in deterministic finite automata (DFA), enabling structured representation of task specifications. Building upon this, we introduce a five-dimensional fine-grained evaluation framework assessing path correctness, ordering consistency, prefix criticality, hazardous API call rate, and execution efficiency. Our methodology integrates Kendall rank correlation, exact path-matching algorithms, and risk-aware API call detection to enable quantifiable, interpretable analysis of the entire function-call sequence. Contribution/Results: Experiments across diverse environments demonstrate that our framework effectively discriminates between agents exhibiting comparable performance under conventional metrics—thereby significantly improving assessment accuracy and discriminative power. This work establishes a novel, principled paradigm for evaluating the reliability of large language model–based agents.

Technology Category

Application Category

📝 Abstract
Evaluating AI agents that solve real-world tasks through function-call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool-use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall's tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents using function-call sequences beyond final state
Overcoming limitations of binary final-state evaluation in agent benchmarks
Assessing safety, efficiency and intermediate correctness in agent behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

DFA-based framework for task encoding
Five-metric suite for path evaluation
Multi-dimensional assessment beyond final state
🔎 Similar Papers
No similar papers found.
P
Panagiotis Michelakis
Synkrasis Labs, Athens, Greece
Y
Yiannis Hadjiyiannis
Synkrasis Labs, Athens, Greece
Dimitrios Stamoulis
Dimitrios Stamoulis
Harbin Institute of Technology
Agentic AIGeospatial AIComputer VisionHardware systems