OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing LLM-based NLG evaluation methods rely on proprietary models and lack fine-grained interpretability. This paper introduces the first fully open-source, reference-free, multi-dimensional NLG evaluation framework, enabling error-span-level localization and natural language explanation generation. Methodologically, it integrates a two-stage open-weight LLM ensemble with lightweight fine-tuning to achieve efficient, reproducible evaluation modeling. In meta-evaluation, our framework significantly improves correlation with human judgments over mainstream benchmarks—surpassing state-of-the-art in certain tasks—while doubling explanation accuracy. It further demonstrates strong generalization across tasks, domains, and evaluation dimensions. Key contributions include: (1) an open, reproducible architectural design; (2) an error-span-aware mechanism for granular assessment; and (3) the unified achievement of high human-correlation and robust explanatory fidelity.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

Problem

Research questions and friction points this paper is trying to address.

Develops an open-source NLG evaluation metric

Addresses reliance on proprietary models for evaluations

Provides fine-grained explanatory feedback for NLG systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source NLG evaluation metric

Two-stage ensemble of open-weight LLMs

Fine-tuned model with error span explanations

🔎 Similar Papers

No similar papers found.