🤖 AI Summary
This paper addresses the challenge of cross-lingual semantic retrieval for historical Luxembourgish texts, where state-of-the-art multilingual embedding models suffer severe performance degradation due to OCR noise and archaic orthography. We propose a lightweight domain adaptation framework that integrates GPT-4o–assisted historical text tokenization, translation, and parallel sentence pair construction, followed by fine-tuning of multilingual embedding models on historical-domain data. Our approach achieves 98% accuracy in historical Luxembourgish–German/French cross-lingual retrieval—the first such result. Key contributions include: (1) releasing the first high-quality bilingual evaluation dataset for historical Luxembourgish–German/French (20,000 sentence pairs); (2) publishing the first multilingual embedding model specifically adapted to historical Luxembourgish; and (3) empirically validating an effective technical pathway for cross-lingual retrieval in low-resource historical languages.
📝 Abstract
The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models, typically evaluated on contemporary texts, face challenges with historical digitized content due to OCR noise and outdated spellings. We explore the use of multilingual embeddings for cross-lingual semantic search on historical Luxembourgish, a low-resource language. We collect historical Luxembourgish news articles spanning various time periods and use GPT-4o to segment and translate them into closely related languages, creating 20,000 parallel training sentences per language pair. We further create a historical bitext mining evaluation set and find that these models struggle to perform cross-lingual search on historical Luxembourgish. To address this, we propose a simple adaptation method using in-domain training data, achieving up to 98% accuracy in cross-lingual evaluations. We release our adapted models and historical Luxembourgish-German/French bitexts to support further research.