Dialect Normalization using Large Language Models and Morphological Rules

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Low-resource dialects lack parallel corpora, rendering standard natural language understanding (NLU) tools ineffective. To address this, we propose a parallel-data-free dialect-to-standard-language normalization method that— for the first time—integrates linguistics-driven explicit morphological rules with large language model (LLM) few-shot prompting in a synergistic framework. Our approach constrains the LLM’s output space via morphological rule application, thereby mitigating orthographic interference. Evaluated on a Greek dialectal proverb dataset, human evaluation confirms high normalization quality, while downstream semantic analysis demonstrates effective preservation of original meaning. This work establishes a scalable and interpretable paradigm for low-resource dialect understanding and offers a methodological blueprint for integrating structured linguistic knowledge with LLM-based inference in unsupervised normalization tasks.

Technology Category

Application Category

📝 Abstract

Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.

Problem

Research questions and friction points this paper is trying to address.

Normalizing dialect text for standard-language tools

Combining rule-based and LLM methods without parallel data

Improving understanding of Greek dialects and proverbs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines rule-based and LLM transformations

Uses few-shot prompting without parallel data

Applies method to Greek dialects and proverbs

🔎 Similar Papers

Historical German Text Normalization Using Type- and Token-Based Language Modeling