xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

📅 2024-05-20

📈 Citations: 3

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing LLM evaluation frameworks suffer from unreliability due to regex-based answer extraction errors, test-set leakage, and prompt overfitting. Method: We propose xFinder—the first end-to-end learnable answer extraction and matching evaluator. It replaces hand-crafted regexes with a lightweight, semantics-aware, trainable extraction module, trained on the newly introduced KAF dataset—the first benchmark specifically designed for generalizable evaluator training. Using a 500M-parameter LLM backbone, xFinder incorporates supervised fine-tuning and structured output constraints. Results: On the minimal model variant, xFinder achieves 93.42% answer extraction accuracy—significantly surpassing the SOTA regex-based method (74.38%)—and attains 97.61% final judgment accuracy. It outperforms existing evaluators in fairness, generality, and computational efficiency, establishing a new standard for robust, learnable LLM assessment.

Technology Category

Application Category

📝 Abstract

The continuous advancement of large language models (LLMs) has brought increasing attention to the critical issue of developing fair and reliable methods for evaluating their performance. Particularly, the emergence of cheating phenomena, such as test set leakage and prompt format overfitting, poses significant challenges to the reliable evaluation of LLMs. As evaluation frameworks commonly use Regular Expression (RegEx) for answer extraction, models may adjust their responses to fit formats easily handled by RegEx. Nevertheless, the key answer extraction module based on RegEx frequently suffers from extraction errors. Furthermore, recent studies proposing fine-tuned LLMs as judge models for automated evaluation face challenges in terms of generalization ability and fairness. This paper comprehensively analyzes the entire LLM evaluation chain and demonstrates that optimizing the key answer extraction module improves extraction accuracy and enhances evaluation reliability. Our findings suggest that improving the key answer extraction module can lead to higher judgment accuracy and improved evaluation efficiency compared to the judge models. To address these issues, we propose xFinder, a novel evaluator for answer extraction and matching in LLM evaluation. As part of this process, we create a specialized dataset, the extbf{K}ey extbf{A}nswer extbf{F}inder (KAF) dataset, to ensure effective model training and evaluation. Generalization tests and real-world evaluations show that the smallest xFinder model, with only 500 million parameters, achieves an average extraction accuracy of 93.42%. In contrast, RegEx accuracy in the best evaluation framework is 74.38%. The final judgment accuracy of xFinder reaches 97.61%, outperforming existing evaluation frameworks and judge models.

Problem

Research questions and friction points this paper is trying to address.

Improves key answer extraction accuracy

Enhances evaluation reliability in LLMs

Proposes xFinder for automated evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes key answer extraction module

Introduces xFinder for answer matching

Uses KAF dataset for model training

🔎 Similar Papers

No similar papers found.