AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the longstanding underrepresentation of European Portuguese (pt-PT) in large language model (LLM) training data and evaluation frameworks, where existing benchmarks inadequately capture its linguistic and cultural specificity. To remedy this gap, the study presents the first systematic effort focused exclusively on pt-PT, fine-tuning open-source LLMs by injecting high-quality native pt-PT corpora during mid-to-late training stages. Furthermore, the authors introduce the first multidimensional native evaluation suite for pt-PT, encompassing translation, text generation, linguistic competence, and dialectal bias assessment between pt-PT and Brazilian Portuguese (pt-BR). Experimental results demonstrate that the resulting model matches strong baselines on general translation tasks while significantly outperforming them on pt-PT–specific evaluations, thereby validating the efficacy of targeted training and native-centric assessment.

Technology Category

Application Category

📝 Abstract

Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.

Problem

Research questions and friction points this paper is trying to address.

European Portuguese

large language models

underrepresented languages

native evaluation

linguistic nuances

Innovation

Methods, ideas, or system contributions that make the work stand out.

European Portuguese LLM

native language benchmarking

high-quality pt-PT data

language-specific fine-tuning

open-source language model

🔎 Similar Papers

No similar papers found.