🤖 AI Summary
This work addresses the challenges of constructing low-resource German dialect dictionaries, which are hindered by scarce annotated data, high spelling variability, and suboptimal performance of large language models. The authors propose a lightweight random forest model leveraging string similarity features for data-driven bilingual dictionary induction. Notably, this approach demonstrates—for the first time—that such statistical models can outperform massive language models like Mistral-123B on dialect dictionary induction tasks. When integrated with BM25 query expansion, the method substantially enhances cross-dialect knowledge transfer, yielding a 28.9% improvement in nDCG@10 and a 50.7% gain in Recall@100 for dialect information retrieval, thereby offering an effective solution for low-resource scenarios.
📝 Abstract
Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in Recall@100. Motivated by the resource scarcity in dialects, we further investigate the extent to which models transfer across different German dialects, and their performance under varying amounts of training data.