🤖 AI Summary
This work proposes a reference-free, multi-independent evaluator framework for large language model (LLM) assessment, addressing the reliance on ground-truth labels or inter-evaluator coordination in existing approaches. By leveraging human-aligned scoring criteria to guide multiple independently prompted LLM evaluators, the method supports both discrete and continuous joint scoring, making it adaptable across diverse task settings. Notably, it eliminates the need for reference texts and coordination among evaluators, offering enhanced flexibility, interpretability, and scalability. Experimental results demonstrate that the proposed approach achieves high agreement with human judgments across multiple tasks, significantly outperforming current methods while substantially reducing computational overhead, thereby enabling efficient and robust LLM evaluation.
📝 Abstract
We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.