ACADATA: Parallel Dataset of Academic Data for Machine Translation

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality multilingual parallel corpora—particularly those with long-context academic translation data—this paper introduces ACADATA, the first large-scale, fine-grained annotated academic translation corpus (1.5 million author-written segment pairs across 12 language directions), alongside ACAD-BENCH, a dedicated evaluation benchmark (~6,000 samples). We propose a two-stage construction paradigm: human-authored academic paragraph generation followed by large language model fine-tuning. Leveraging this framework, we train lightweight models across 96 language directions. Experimental results demonstrate average d-BLEU improvements of +6.1 (7B models) and +12.4 (2B models) over baselines; long-context translation performance improves by up to 24.9%. Our models significantly outperform both leading open-source and proprietary translation systems.

Technology Category

Application Category

📝 Abstract
We present ACADATA, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-TRAIN, which contains approximately 1.5 million author-generated paragraph pairs across 96 language directions and ACAD-BENCH, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its utility, we fine-tune two Large Language Models (LLMs) on ACAD-TRAIN and benchmark them on ACAD-BENCH against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine-tuning on ACAD-TRAIN leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best propietary and open-weight models on academic translation domain. By releasing ACAD-TRAIN, ACAD-BENCH and the fine-tuned models, we provide the community with a valuable resource to advance research in academic domain and long-context translation.
Problem

Research questions and friction points this paper is trying to address.

Creating a parallel dataset for academic machine translation
Evaluating fine-tuned LLMs against specialized translation systems
Improving academic and long-context translation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created parallel academic dataset for machine translation
Fine-tuned LLMs on academic dataset for translation
Achieved superior performance over proprietary translation models
🔎 Similar Papers
No similar papers found.
I
Iñaki Lacunza
Barcelona Supercomputing Center (BSC)
J
Javier Garcia Gilabert
Barcelona Supercomputing Center (BSC)
F
Francesca De Luca Fornaciari
Barcelona Supercomputing Center (BSC)
J
Javier Aula-Blasco
Barcelona Supercomputing Center (BSC)
Aitor Gonzalez-Agirre
Aitor Gonzalez-Agirre
Barcelona Supercomputing Center (BSC)
Artificial IntelligenceNatural Language ProcessingSemanticsDeep Learning
Maite Melero
Maite Melero
Senior researcher Barcelona Supercomputing Center
Marta Villegas
Marta Villegas
Barcelona Supercomputing Center
Natural Language Processing