MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work proposes a reference-free, multi-independent evaluator framework for large language model (LLM) assessment, addressing the reliance on ground-truth labels or inter-evaluator coordination in existing approaches. By leveraging human-aligned scoring criteria to guide multiple independently prompted LLM evaluators, the method supports both discrete and continuous joint scoring, making it adaptable across diverse task settings. Notably, it eliminates the need for reference texts and coordination among evaluators, offering enhanced flexibility, interpretability, and scalability. Experimental results demonstrate that the proposed approach achieves high agreement with human judgments across multiple tasks, significantly outperforming current methods while substantially reducing computational overhead, thereby enabling efficient and robust LLM evaluation.

Technology Category

Application Category

📝 Abstract

We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.

Problem

Research questions and friction points this paper is trying to address.

reference-free evaluation

Large Language Models

human-aligned evaluation

automatic assessment

LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-free evaluation

human-aligned assessment

ensemble of independent evaluators

large language model evaluation

scalable LLM benchmarking

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2024-06-26arXiv.orgCitations: 69