🤖 AI Summary
This work addresses two pervasive challenges in constructing extractive question-answering (QA) datasets for low-resource languages—specifically Amharic: (1) misalignment between translated questions and their corresponding answers, and (2) interference from multiple candidate answers during translation. We propose the first reproducible, fully automated pipeline for dataset construction. Our method integrates a dual alignment verification mechanism—combining cosine similarity of fine-tuned Amharic BERT embeddings with longest common subsequence (LCS) matching—to rigorously enforce question-answer boundary consistency in translations. Additionally, we employ synthetic data augmentation to enhance robustness. Leveraging this pipeline, we introduce AmaSQuAD, the first open-source Amharic QA benchmark, derived from SQuAD 2.0. Experiments show substantial improvements: +7.49–7.86 F1 points on the AmaSQuAD development set, and +1.0 F1 and +0.16 EM points on the human-annotated AmQA dataset—marking significant progress for low-resource language QA research.
📝 Abstract
This research presents a novel framework for translating extractive question-answering datasets into low-resource languages, as demonstrated by the creation of the AmaSQuAD dataset, a translation of SQuAD 2.0 into Amharic. The methodology addresses challenges related to misalignment between translated questions and answers, as well as the presence of multiple answer instances in the translated context. For this purpose, we used cosine similarity utilizing embeddings from a fine-tuned BERT-based model for Amharic and Longest Common Subsequence (LCS). Additionally, we fine-tune the XLM-R model on the AmaSQuAD synthetic dataset for Amharic Question-Answering. The results show an improvement in baseline performance, with the fine-tuned model achieving an increase in the F1 score from 36.55% to 44.41% and 50.01% to 57.5% on the AmaSQuAD development dataset. Moreover, the model demonstrates improvement on the human-curated AmQA dataset, increasing the F1 score from 67.80% to 68.80% and the exact match score from 52.50% to 52.66%.The AmaSQuAD dataset is publicly available Datasets