Pearmut: Human Evaluation of Translation Made Trivial

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Although human evaluation remains the gold standard for multilingual NLP, its adoption is hindered by operational complexity. This work proposes Pearmut, a lightweight platform that streamlines end-to-end human evaluation by integrating established protocols such as DA, ESA, and MQM, while also supporting rapid prototyping of novel evaluation schemes. Pearmut accommodates document-level context, enables both absolute and pairwise comparison assessments, and incorporates attention checks to ensure annotation quality. By leveraging ESAAI-based pre-annotation and flexible task assignment strategies—spanning static and active learning—it substantially lowers the barrier to entry. The platform thus renders high-quality human evaluation as accessible and routine as automatic metrics, facilitating its integration into everyday model development and diagnostic workflows.

Technology Category

Application Category

📝 Abstract
Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.
Problem

Research questions and friction points this paper is trying to address.

human evaluation
machine translation
multilingual NLP
evaluation protocols
translation assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

human evaluation
machine translation
evaluation platform
active learning
multilingual NLP
🔎 Similar Papers
No similar papers found.