Dialect Normalization using Large Language Models and Morphological Rules

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource dialects lack parallel corpora, rendering standard natural language understanding (NLU) tools ineffective. To address this, we propose a parallel-data-free dialect-to-standard-language normalization method that— for the first time—integrates linguistics-driven explicit morphological rules with large language model (LLM) few-shot prompting in a synergistic framework. Our approach constrains the LLM’s output space via morphological rule application, thereby mitigating orthographic interference. Evaluated on a Greek dialectal proverb dataset, human evaluation confirms high normalization quality, while downstream semantic analysis demonstrates effective preservation of original meaning. This work establishes a scalable and interpretable paradigm for low-resource dialect understanding and offers a methodological blueprint for integrating structured linguistic knowledge with LLM-based inference in unsupervised normalization tasks.

Technology Category

Application Category

📝 Abstract
Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.
Problem

Research questions and friction points this paper is trying to address.

Normalizing dialect text for standard-language tools
Combining rule-based and LLM methods without parallel data
Improving understanding of Greek dialects and proverbs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines rule-based and LLM transformations
Uses few-shot prompting without parallel data
Applies method to Greek dialects and proverbs
🔎 Similar Papers
No similar papers found.
A
Antonios Dimakis
Archimedes, Athena Research Center, Greece; Department of Informatics and Telecommunications, NKUA
John Pavlopoulos
John Pavlopoulos
Athens University of Economics and Business
Machine LearningNLPData Science
A
Antonios Anastasopoulos
Archimedes, Athena Research Center, Greece; Department of Computer Science, George Mason University