Pearmut: Human Evaluation of Translation Made Trivial

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Although human evaluation remains the gold standard for multilingual NLP, its adoption is hindered by operational complexity. This work proposes Pearmut, a lightweight platform that streamlines end-to-end human evaluation by integrating established protocols such as DA, ESA, and MQM, while also supporting rapid prototyping of novel evaluation schemes. Pearmut accommodates document-level context, enables both absolute and pairwise comparison assessments, and incorporates attention checks to ensure annotation quality. By leveraging ESAAI-based pre-annotation and flexible task assignment strategies—spanning static and active learning—it substantially lowers the barrier to entry. The platform thus renders high-quality human evaluation as accessible and routine as automatic metrics, facilitating its integration into everyday model development and diagnostic workflows.

Technology Category

Application Category

📝 Abstract

Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.

Problem

Research questions and friction points this paper is trying to address.

human evaluation

machine translation

multilingual NLP

evaluation protocols

translation assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

human evaluation

machine translation

evaluation platform

active learning

multilingual NLP

🔎 Similar Papers

AI-Assisted Human Evaluation of Machine Translation

2024-06-18arXiv.orgCitations: 2