🤖 AI Summary
This work addresses the inefficiency and poor reproducibility of large language model (LLM) evaluation, which often relies on labor-intensive manual processes such as benchmark selection, code reproduction, and metric interpretation. To overcome these limitations, we propose the first agent-driven automated evaluation system that translates natural language evaluation requests into end-to-end executable, traceable, and customizable evaluation workflows. The system leverages NL2Bench for intent parsing and benchmark planning, BenchResolve for standardized data acquisition, and integrates task-aware metric selection with decision-oriented report generation. Human-in-the-loop checkpoints and a sample evidence chain mechanism are introduced to significantly enhance transparency, controllability, and reproducibility. In industrial settings, the framework enables efficient execution of diverse evaluation tasks with minimal human intervention.
📝 Abstract
Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.