🤖 AI Summary
Low-resource dialects lack parallel corpora, rendering standard natural language understanding (NLU) tools ineffective. To address this, we propose a parallel-data-free dialect-to-standard-language normalization method that— for the first time—integrates linguistics-driven explicit morphological rules with large language model (LLM) few-shot prompting in a synergistic framework. Our approach constrains the LLM’s output space via morphological rule application, thereby mitigating orthographic interference. Evaluated on a Greek dialectal proverb dataset, human evaluation confirms high normalization quality, while downstream semantic analysis demonstrates effective preservation of original meaning. This work establishes a scalable and interpretable paradigm for low-resource dialect understanding and offers a methodological blueprint for integrating structured linguistic knowledge with LLM-based inference in unsupervised normalization tasks.
📝 Abstract
Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.