Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses structural corruption and quality degradation in cross-lingual transfer of span-annotated QA datasets (e.g., SQuAD 2.0). We propose a format-aware, lightweight machine translation method that leverages DeepL’s document-level formatting preservation capability to translate raw HTML/XML-formatted, annotation-rich texts end-to-end—bypassing annotation deconstruction, reconstruction, or manual intervention. To our knowledge, this is the first application of format-aware translation to span-level annotation transfer. Our approach significantly outperforms sentence-level translation and back-translation baselines. Using it, we construct the first high-quality Finnish SQuAD 2.0 dataset. A BERT-based retriever fine-tuned on this data achieves state-of-the-art performance on the FiQA benchmark. All resources—including dataset, code, and models—are publicly released on Hugging Face and GitHub.

Technology Category

Application Category

📝 Abstract
We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an open license: data at HuggingFace TurkuNLP/squad_v2_fi, code on GitHub TurkuNLP/squad2-fi, and model at HuggingFace TurkuNLP/bert-base-finnish-cased-squad2.
Problem

Research questions and friction points this paper is trying to address.

Machine translate span-annotated datasets using formatted documents
Produce Finnish SQuAD2.0 QA dataset via DeepL MT service
Evaluate translation quality through multiple validation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using DeepL MT service for translation
Translating formatted documents with span annotations
Producing Finnish SQuAD dataset via MT
🔎 Similar Papers
No similar papers found.
E
Emil Nuutinen
TurkuNLP, Department of Computing, University of Turku, Finland
I
Iiro Rastas
TurkuNLP, Department of Computing, University of Turku, Finland
Filip Ginter
Filip Ginter
University of Turku
language technologynatural language processing