Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the limitations of existing machine translation quality estimation methods, which rely on scalar scores that lack interpretability and underperform on low-resource languages. The authors construct the first fine-grained English–Malayalam quality estimation dataset, enriched with detailed error annotations. They propose ALOPE-RL, a novel framework integrating LoRA-based fine-tuning, 4-bit quantization, and policy gradient reinforcement learning to enable efficient training on small-scale large language models (≤4B parameters). With minimal computational overhead and in few-shot settings, ALOPE-RL significantly outperforms both conventional encoder-based models and larger LLMs, achieving state-of-the-art performance while enabling interpretable, fine-grained translation quality judgments for the first time.

Technology Category

Application Category

📝 Abstract

Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.

Problem

Research questions and friction points this paper is trying to address.

Quality Estimation

Machine Translation

Low-resource Languages

Error-aware Evaluation

Scalar Scores

Innovation

Methods, ideas, or system contributions that make the work stand out.

Error-Aware Quality Estimation

Reinforcement Learning for QE

Low-Resource Machine Translation