SedarEval: Automated Evaluation using Self-Adaptive Rubrics

๐Ÿ“… 2025-01-26
๐Ÿ›๏ธ Conference on Empirical Methods in Natural Language Processing
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM automatic scoring methods rely on generic scoring rules, neglecting domain-specific characteristics of questions and answers, resulting in suboptimal accuracy and stability. Method: We propose a novel paradigm for evaluating LLM outputs by generating problem-specific, structured, and interpretable adaptive rubricsโ€”tailored to four challenging task categories: long-tail knowledge, mathematics, programming, and logical reasoning. Our core innovation integrates question-characteristic modeling with solution-process awareness in rubric generation. Contribution/Results: We introduce SedarEval, the first open-source fine-grained evaluation benchmark comprising 1,000 questions and their corresponding rubrics. Leveraging structured rubric injection, prompt engineering, and supervised fine-tuning, we train a dedicated evaluation language model. Experiments demonstrate that our model significantly outperforms general-purpose models (e.g., GPT-4) across multiple domains, achieving higher human agreement, superior scoring accuracy, and enhanced robustness.

Technology Category

Application Category

๐Ÿ“ Abstract
The evaluation paradigm of LLM-as-judge gains popularity due to its significant reduction in human labor and time costs. This approach utilizes one or more large language models (LLMs) to assess the quality of outputs from other LLMs. However, existing methods rely on generic scoring rubrics that fail to consider the specificities of each question and its problem-solving process, compromising precision and stability in assessments. Inspired by human examination scoring processes, we propose a new evaluation paradigm based on self-adaptive rubrics. Specifically, we create detailed scoring rubrics for each question, capturing the primary and secondary criteria in a structured format of scoring and deduction points that mimic a human evaluator's analytical process. Building on this paradigm, we further develop a novel benchmark called SedarEval, which covers a range of domains including long-tail knowledge, mathematics, coding, and logical reasoning. SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric. To further streamline the evaluation, we train a specialized evaluator language model (evaluator LM) to supplant human graders. Using the same training data, our evaluator LM achieves a higher concordance rate with human grading results than other paradigms, including GPT-4, highlighting the superiority and efficiency of our approach. We release our dataset at https://github.com/wwn1233/sedareval.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Scoring Accuracy
Adaptability to Diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

SedarEval
Adaptive Scoring Rules
Automated Grading Model
๐Ÿ”Ž Similar Papers
No similar papers found.