Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

This study systematically investigates large language models’ (LLMs) capability to identify and translate dialectal vocabulary, focusing on Bavarian German. Addressing key challenges—including nonstandard orthography, high morphological variation, and scarcity of annotated resources—we propose DiaLemma, the first framework for constructing dialect lexicons using monolingual corpora only. We further introduce a benchmark dataset comprising 100,000 manually curated German–Bavarian word pairs. Empirical evaluation reveals that LLMs perform relatively well on nouns and semantically similar word pairs but struggle to distinguish inflectional variants from literal translations; while contextual prompting improves translation quality, it concurrently degrades variant identification accuracy. Our findings expose fundamental limitations of LLMs in modeling dialectal morphological variation and provide both a novel methodology and a reproducible benchmark for low-resource dialect NLP.

Technology Category

Application Category

📝 Abstract

Dialects exhibit a substantial degree of variation due to the lack of a standard orthography. At the same time, the ability of Large Language Models (LLMs) to process dialects remains largely understudied. To address this gap, we use Bavarian as a case study and investigate the lexical dialect understanding capability of LLMs by examining how well they recognize and translate dialectal terms across different parts-of-speech. To this end, we introduce DiaLemma, a novel annotation framework for creating dialect variation dictionaries from monolingual data only, and use it to compile a ground truth dataset consisting of 100K human-annotated German-Bavarian word pairs. We evaluate how well nine state-of-the-art LLMs can judge Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma. Our results show that LLMs perform best on nouns and lexically similar word pairs, and struggle most in distinguishing between direct translations and inflected variants. Interestingly, providing additional context in the form of example usages improves the translation performance, but reduces their ability to recognize dialect variants. This study highlights the limitations of LLMs in dealing with orthographic dialect variation and emphasizes the need for future work on adapting LLMs to dialects.

Problem

Research questions and friction points this paper is trying to address.

Investigating LLMs' ability to recognize and translate dialect terms

Creating dialect dictionaries from monolingual data without standard orthography

Evaluating LLM performance on distinguishing dialect translations from variants

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiaLemma framework creates dialect dictionaries from monolingual data

Evaluates LLMs on recognizing dialect translations and inflected variants

Uses contextual examples to improve translation but reduces variant recognition

🔎 Similar Papers

No similar papers found.