AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding underrepresentation of European Portuguese (pt-PT) in large language model (LLM) training data and evaluation frameworks, where existing benchmarks inadequately capture its linguistic and cultural specificity. To remedy this gap, the study presents the first systematic effort focused exclusively on pt-PT, fine-tuning open-source LLMs by injecting high-quality native pt-PT corpora during mid-to-late training stages. Furthermore, the authors introduce the first multidimensional native evaluation suite for pt-PT, encompassing translation, text generation, linguistic competence, and dialectal bias assessment between pt-PT and Brazilian Portuguese (pt-BR). Experimental results demonstrate that the resulting model matches strong baselines on general translation tasks while significantly outperforming them on pt-PT–specific evaluations, thereby validating the efficacy of targeted training and native-centric assessment.
📝 Abstract
Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.
Problem

Research questions and friction points this paper is trying to address.

European Portuguese
large language models
underrepresented languages
native evaluation
linguistic nuances
Innovation

Methods, ideas, or system contributions that make the work stand out.

European Portuguese LLM
native language benchmarking
high-quality pt-PT data
language-specific fine-tuning
open-source language model
🔎 Similar Papers
No similar papers found.
A
Afonso Simplício
NOVA School of Science and Technology, NOVA LINCS
G
Gonçalo Vinagre
NOVA School of Science and Technology, NOVA LINCS
M
Miguel Moura Ramos
Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa
Diogo Tavares
Diogo Tavares
NOVA School of Science and Technology
Rafael Ferreira
Rafael Ferreira
PhD Student, Nova School of Science and Technology
Conversational AgentsMachine LearningArtificial Intelligence
Giuseppe Attanasio
Giuseppe Attanasio
Postdoctoral Researcher, Instituto de Telecomunicações
AIFairnessTransparencySafety
Duarte M. Alves
Duarte M. Alves
PhD Student, Instituto Superior Técnico, Lisbon
Natural Language ProcessingMachine Learning
I
Inês Calvo
NOVA School of Science and Technology
I
Inês Vieira
NOVA School of Science and Technology
Rui Guerra
Rui Guerra
Professor de Física, Departamento de Física e CEOT, Faculdade de Ciências e Tecnologia, Universidade do Algarve
EspectroscopiaPós-colheitamétodos não invasivosespalhamento da luzóptica
J
James Furtado
NOVA School of Science and Technology, NOVA LINCS
B
Beatriz Canaverde
Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa
I
Iago Paulo
NOVA School of Science and Technology, NOVA LINCS
Vasco Ramos
Vasco Ramos
Phd Student, Nova School of Science and Technology
Vision-and-LanguageMultimodalText-to-Image
Diogo Glória-Silva
Diogo Glória-Silva
4th Year PhD School of Science and Technology, NOVA University,
procedural plan guidancevision and language models
Miguel Faria
Miguel Faria
Unknown affiliation
M
Marcos Treviso
Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa
D
Daniel Gomes
Fundação para a Ciência e Tecnologia
P
Pedro Gomes
Fundação para a Ciência e Tecnologia
David Semedo
David Semedo
Universidade NOVA de Lisboa
Vision and LanguageDeep Learning for MultimediaConversational AI
A
André Martins
Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa
J
João Magalhães
NOVA School of Science and Technology, NOVA LINCS