OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based NLG evaluation methods rely on proprietary models and lack fine-grained interpretability. This paper introduces the first fully open-source, reference-free, multi-dimensional NLG evaluation framework, enabling error-span-level localization and natural language explanation generation. Methodologically, it integrates a two-stage open-weight LLM ensemble with lightweight fine-tuning to achieve efficient, reproducible evaluation modeling. In meta-evaluation, our framework significantly improves correlation with human judgments over mainstream benchmarks—surpassing state-of-the-art in certain tasks—while doubling explanation accuracy. It further demonstrates strong generalization across tasks, domains, and evaluation dimensions. Key contributions include: (1) an open, reproducible architectural design; (2) an error-span-aware mechanism for granular assessment; and (3) the unified achievement of high human-correlation and robust explanatory fidelity.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.
Problem

Research questions and friction points this paper is trying to address.

Develops an open-source NLG evaluation metric
Addresses reliance on proprietary models for evaluations
Provides fine-grained explanatory feedback for NLG systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source NLG evaluation metric
Two-stage ensemble of open-weight LLMs
Fine-tuned model with error span explanations
🔎 Similar Papers
No similar papers found.
I
Ivan Kartáˇc
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Prague, Czech Republic
Mateusz Lango
Mateusz Lango
Charles University / Poznan University of Technology
natural language processingmachine learningexplainable AI
O
Ondˇrej Dušek
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Prague, Czech Republic