LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study addresses the limitations of existing machine translation evaluation methods in detecting fine-grained, dialect- and culture-related errors—such as variant mismatches, content omissions, and pragmatic infelicities—in diglossic languages like Arabic. To this end, the authors propose LQM, the first multidimensional quality assessment framework that integrates sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphemics, and is designed for cross-lingual extensibility. Leveraging a newly curated parallel corpus of 3,850 sentences spanning multiple Arabic dialects, the work conducts fine-grained error diagnosis across 6,113 annotated error spans in 3,495 sentences, using expert span annotation, zero-shot large language model evaluation, and the spBLEU metric. The paper also releases the dataset, annotation guidelines, and prompt templates to support future research.

Technology Category

Application Category

📝 Abstract

Existing MT evaluation frameworks, including automatic metrics and human evaluation schemes such as Multidimensional Quality Metrics (MQM), are largely language-agnostic. However, they often fail to capture dialect- and culture-specific errors in diglossic languages (e.g., Arabic), where translation failures stem from mismatches in language variety, content coverage, and pragmatic appropriateness rather than surface form alone.We introduce LQM: Linguistically Motivated Multidimensional Quality Metrics for MT. LQM is a hierarchical error taxonomy for diagnosing MT errors through six linguistically grounded levels: sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphetics (Figure 1). We construct a bidirectional parallel corpus of 3,850 sentences (550 per variety) spanning seven Arabic dialects (Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni), derived from conversational, culturally rich content. We evaluate six LLMs in a zero-shot setting and conduct expert span-level human annotation using LQM, producing 6,113 labeled error spans across 3,495 unique erroneous sentences, along with severity-weighted quality scores. We complement this analysis with an automatic metric (spBLEU). Though validated here on Arabic, LQM is a language-agnostic framework designed to be easily applied to or adapted for other languages. LQM annotated errors data, prompts, and annotation guidelines are publicly available at https://github.com/UBC-NLP/LQM_MT.

Problem

Research questions and friction points this paper is trying to address.

machine translation evaluation

diglossic languages

dialect-specific errors

cultural appropriateness

linguistic quality metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

LQM

multidimensional quality metrics

diglossic languages

linguistically grounded error taxonomy

machine translation evaluation

🔎 Similar Papers

No similar papers found.