๐ค AI Summary
Conventional MQM-based analytical translation quality evaluation (TQE) employs a linear error penalty model, whichโwhen calibrated on fixed-length reference samples (1000โ2000 words)โover-penalizes short texts and underestimates errors in long texts, contradicting expert intuition. Method: We propose a nonlinear scoring model grounded in the Weber-Fechner law and cognitive load theory, formalizing a two-parameter logarithmic error tolerance function $E(x) = a ln(1 + bx)$, and integrate it into a Multi-Range framework to ensure length-aware, consistent calibration across text lengths. Contribution/Results: Evaluated across three enterprise scenarios, the model achieves expert alignment within ยฑ20% absolute deviation, significantly improving human-machine evaluation consistency and interpretability. This work is the first to systematically incorporate psychophysical principles into TQE modeling, effectively resolving the inherent length bias of linear models.
๐ Abstract
Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition.
Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size.
Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model
E(x) = a * ln(1 + b * x), a, b > 0,
anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added.
The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.