Non-Linear Scoring Model for Translation Quality Evaluation

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Conventional MQM-based analytical translation quality evaluation (TQE) employs a linear error penalty model, which—when calibrated on fixed-length reference samples (1000–2000 words)—over-penalizes short texts and underestimates errors in long texts, contradicting expert intuition. Method: We propose a nonlinear scoring model grounded in the Weber-Fechner law and cognitive load theory, formalizing a two-parameter logarithmic error tolerance function $E(x) = a ln(1 + bx)$, and integrate it into a Multi-Range framework to ensure length-aware, consistent calibration across text lengths. Contribution/Results: Evaluated across three enterprise scenarios, the model achieves expert alignment within ±20% absolute deviation, significantly improving human-machine evaluation consistency and interpretability. This work is the first to systematically incorporate psychophysical principles into TQE modeling, effectively resolving the inherent length bias of linear models.

Technology Category

Application Category

📝 Abstract

Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

Problem

Research questions and friction points this paper is trying to address.

Linear translation scoring over-penalizes short texts and under-penalizes long ones

Traditional MQM models misalign with human perception across varying sample sizes

Current evaluation methods lack fairness and reliability for different text lengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-linear scoring model replaces linear extrapolation

Model uses logarithmic function based on tolerance points

Improves fairness and reliability across varying sample sizes

🔎 Similar Papers

No similar papers found.

Authors to Follow