π€ AI Summary
Cross-dialect information retrieval (CDIR) for low-resource, highly divergent German dialects remains an unsolved challenge due to extreme lexical, morphological, and syntactic variation and the absence of annotated benchmarks. Method: This work formally defines the CDIR task and introduces WikiDIRβthe first publicly available benchmark dataset covering seven German dialects. We propose a lightweight document-level translation paradigm to align dialectal variants, circumventing the failure of zero-shot multilingual models (e.g., mBERT, XLM-R) under ultra-low-resource conditions. Contribution/Results: Empirical evaluation reveals severe performance degradation of both traditional lexical methods and state-of-the-art multilingual encoders on dialectal text; in contrast, integrating document-level machine translation preprocessing yields substantial gains in retrieval accuracy. Our approach offers a scalable, cost-effective solution for accessing information in low-resource dialects, establishing the foundational framework and empirical benchmark for CDIR research.
π Abstract
A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models. We finally demonstrate that (document) translation is an effective way to reduce the dialect gap in CDIR.