🤖 AI Summary
This work addresses the longstanding underrepresentation of European Portuguese (pt-PT) in large language model (LLM) training data and evaluation frameworks, where existing benchmarks inadequately capture its linguistic and cultural specificity. To remedy this gap, the study presents the first systematic effort focused exclusively on pt-PT, fine-tuning open-source LLMs by injecting high-quality native pt-PT corpora during mid-to-late training stages. Furthermore, the authors introduce the first multidimensional native evaluation suite for pt-PT, encompassing translation, text generation, linguistic competence, and dialectal bias assessment between pt-PT and Brazilian Portuguese (pt-BR). Experimental results demonstrate that the resulting model matches strong baselines on general translation tasks while significantly outperforming them on pt-PT–specific evaluations, thereby validating the efficacy of targeted training and native-centric assessment.
📝 Abstract
Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.