🤖 AI Summary
Existing LLM-based NLG evaluation methods rely on proprietary models and lack fine-grained interpretability. This paper introduces the first fully open-source, reference-free, multi-dimensional NLG evaluation framework, enabling error-span-level localization and natural language explanation generation. Methodologically, it integrates a two-stage open-weight LLM ensemble with lightweight fine-tuning to achieve efficient, reproducible evaluation modeling. In meta-evaluation, our framework significantly improves correlation with human judgments over mainstream benchmarks—surpassing state-of-the-art in certain tasks—while doubling explanation accuracy. It further demonstrates strong generalization across tasks, domains, and evaluation dimensions. Key contributions include: (1) an open, reproducible architectural design; (2) an error-span-aware mechanism for granular assessment; and (3) the unified achievement of high human-correlation and robust explanatory fidelity.
📝 Abstract
Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.