🤖 AI Summary
Historical newspaper digitization faces challenges including OCR noise, multilingual text mixing, and diachronic language drift—severely undermining the accuracy and reliability of multilingual question answering (QA). This paper introduces the first end-to-end robust multilingual QA system tailored for noisy historical newspapers, supporting both precise answer generation and principled abstention. Our method integrates semantic query expansion with reciprocal rank fusion (RRF) to enhance cross-lingual and diachronic retrieval stability; employs evidence-aware, strongly constrained generation prompting to explicitly model answerability and suppress hallucination; and adopts a modular RAG architecture enabling decomposable evaluation and full reproducibility. Experiments demonstrate substantial gains in recall robustness: the system produces high-fidelity answers for answerable questions and achieves accurate abstention on unanswerable ones. Code and configuration files are publicly released.
📝 Abstract
Large-scale digitization initiatives have unlocked massive collections of historical newspapers, yet effective computational access remains hindered by OCR corruption, multilingual orthographic variation, and temporal language drift. We develop and evaluate a multilingual Retrieval-Augmented Generation pipeline specifically designed for question answering on noisy historical documents. Our approach integrates: (i) semantic query expansion and multi-query fusion using Reciprocal Rank Fusion to improve retrieval robustness against vocabulary mismatch; (ii) a carefully engineered generation prompt that enforces strict grounding in retrieved evidence and explicit abstention when evidence is insufficient; and (iii) a modular architecture enabling systematic component evaluation. We conduct comprehensive ablation studies on Named Entity Recognition and embedding model selection, demonstrating the importance of syntactic coherence in entity extraction and balanced performance-efficiency trade-offs in dense retrieval. Our end-to-end evaluation framework shows that the pipeline generates faithful answers for well-supported queries while correctly abstaining from unanswerable questions. The hybrid retrieval strategy improves recall stability, particularly benefiting from RRF's ability to smooth performance variance across query formulations. We release our code and configurations at https://anonymous.4open.science/r/RAGs-C5AE/, providing a reproducible foundation for robust historical document question answering.