Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing LLM-based automatic evaluation methods rely on predefined, generic criteria, limiting generalization to unseen instructions and exhibiting insufficient robustness in quantitative and structural constraint assessment. To address these limitations, we propose ARJudge—a novel “Analysis–Refinement” framework. The Analyzer module fine-tunes an open-source LLM to adaptively generate multidimensional evaluation criteria; the Refiner module performs zero-shot joint discrimination of numerical accuracy, format compliance, and other structural constraints by integrating semantic analysis with code execution verification. ARJudge is trained on a composite corpus covering three task categories: criterion generation, textual analysis, and code analysis. Extensive experiments demonstrate that ARJudge significantly outperforms state-of-the-art fine-tuned evaluators across multiple benchmarks, particularly improving generalization to unseen instructions and enhancing discrimination accuracy for numerical precision and syntactic/format constraints.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.

Problem

Research questions and friction points this paper is trying to address.

Align multi-faceted LLM evaluation criteria.

Enhance adaptability to unseen instructions.

Combine text and code-driven analysis effectively.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive evaluation criteria formulation

Text-based and code-driven synthesis

Fine-tuned Analyzer and tuning-free Refiner

🔎 Similar Papers

MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

2024-10-03arXiv.orgCitations: 5

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

2024-09-17arXiv.orgCitations: 8

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis

2024-08-09Citations: 0