One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the inefficiency and poor reproducibility of large language model (LLM) evaluation, which often relies on labor-intensive manual processes such as benchmark selection, code reproduction, and metric interpretation. To overcome these limitations, we propose the first agent-driven automated evaluation system that translates natural language evaluation requests into end-to-end executable, traceable, and customizable evaluation workflows. The system leverages NL2Bench for intent parsing and benchmark planning, BenchResolve for standardized data acquisition, and integrates task-aware metric selection with decision-oriented report generation. Human-in-the-loop checkpoints and a sample evidence chain mechanism are introduced to significantly enhance transparency, controllability, and reproducibility. In industrial settings, the framework enables efficient execution of diverse evaluation tasks with minimal human intervention.

Technology Category

Application Category

📝 Abstract

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.

Problem

Research questions and friction points this paper is trying to address.

LLM evaluation

benchmark selection

evaluation automation

schema mapping

metric interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic evaluation

natural language to benchmark

automated LLM evaluation