MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a reference-free, multi-independent evaluator framework for large language model (LLM) assessment, addressing the reliance on ground-truth labels or inter-evaluator coordination in existing approaches. By leveraging human-aligned scoring criteria to guide multiple independently prompted LLM evaluators, the method supports both discrete and continuous joint scoring, making it adaptable across diverse task settings. Notably, it eliminates the need for reference texts and coordination among evaluators, offering enhanced flexibility, interpretability, and scalability. Experimental results demonstrate that the proposed approach achieves high agreement with human judgments across multiple tasks, significantly outperforming current methods while substantially reducing computational overhead, thereby enabling efficient and robust LLM evaluation.

Technology Category

Application Category

📝 Abstract
We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

reference-free evaluation
Large Language Models
human-aligned evaluation
automatic assessment
LLM evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-free evaluation
human-aligned assessment
ensemble of independent evaluators
large language model evaluation
scalable LLM benchmarking
🔎 Similar Papers
No similar papers found.
N
Nalin Srun
Université de Lorraine, CNRS, LORIA, F-54000 Nancy, France
Parisa Rastin
Parisa Rastin
Loria, L'école des Mines
Artificial Intelligence
Guénaël Cabanes
Guénaël Cabanes
LORIA, Université de Lorraine
Artificial IntelligenceComplex SystemAnimal Behaviour
L
Lydia Boudjeloud-Assala
Université de Lorraine, CNRS, LORIA, F-57000 Metz, France