RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing LLM evaluation methods heavily rely on strong foundation models, leading to high computational costs, privacy risks, and poor experimental reproducibility. To address these challenges, this paper introduces RocketEval: an efficient, automated evaluation framework leveraging lightweight LLMs (e.g., Gemma-2-2B) and instantiated scoring checklists. Its core innovation is the first checklist-driven, multi-dimensional question-answering evaluation paradigm—achieved through task restructuring into structured Q&A formats, automatic checklist generation, and a gradient-aware dynamic reweighting mechanism that mitigates uncertainty and positional bias inherent in lightweight models. On MT-Bench and WildBench, RocketEval achieves a human-preference correlation of 0.965—on par with GPT-4o—while reducing evaluation cost by over 50×. This advancement significantly improves evaluation efficiency, data privacy, and experimental reproducibility.

Technology Category

Application Category

📝 Abstract

Evaluating large language models (LLMs) in diverse and challenging scenarios is essential to align them with human preferences. To mitigate the prohibitive costs associated with human evaluations, utilizing a powerful LLM as a judge has emerged as a favored approach. Nevertheless, this methodology encounters several challenges, including substantial expenses, concerns regarding privacy and security, and reproducibility. In this paper, we propose a straightforward, replicable, and accurate automated evaluation method by leveraging a lightweight LLM as the judge, named RocketEval. Initially, we identify that the performance disparity between lightweight and powerful LLMs in evaluation tasks primarily stems from their ability to conduct comprehensive analyses, which is not easily enhanced through techniques such as chain-of-thought reasoning. By reframing the evaluation task as a multi-faceted Q&A using an instance-specific checklist, we demonstrate that the limited judgment accuracy of lightweight LLMs is largely attributes to high uncertainty and positional bias. To address these challenges, we introduce an automated evaluation process grounded in checklist grading, which is designed to accommodate a variety of scenarios and questions. This process encompasses the creation of checklists, the grading of these checklists by lightweight LLMs, and the reweighting of checklist items to align with the supervised annotations. Our experiments carried out on the automated evaluation benchmarks, MT-Bench and WildBench datasets, reveal that RocketEval, when using Gemma-2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, RocketEval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios. Our code is available at https://github.com/Joinn99/RocketEval-ICLR .

Problem

Research questions and friction points this paper is trying to address.

High costs and privacy concerns in human evaluations of LLMs.

Limited accuracy of lightweight LLMs in evaluation tasks.

Need for efficient, replicable, and accurate automated evaluation methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses lightweight LLM for cost-effective evaluations

Implements checklist grading to reduce uncertainty

Achieves high correlation with human preferences

🔎 Similar Papers

CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

2024-03-27Citations: 0