Salamandra Technical Report

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
There is a lack of fully open, reproducible multilingual large language models (LLMs) supporting robust text understanding, generation, and code capabilities across diverse languages. Method: We trained the Salamandra series of decoder-only LLMs (2B, 7B, and 40B parameters) from scratch on a custom-built, openly licensed multilingual corpus encompassing 35 European languages and programming languages, with mixed text–code data; models support supervised instruction tuning and preliminary multimodal adaptation. Contribution/Results: This work presents the first end-to-end open, reproducible training pipeline for multilingual LLMs at full scale. We release all models under the Apache 2.0 license, along with training and evaluation scripts and comprehensive benchmarking results—including multilingual, fairness, safety, and robustness evaluations. Empirical results demonstrate competitive performance against leading open-source models of comparable size, significantly advancing open science and trustworthy AI ecosystems.

Technology Category

Application Category

📝 Abstract
This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and safety.With this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.
Problem

Research questions and friction points this paper is trying to address.

Develop open-source multilingual large language models
Evaluate models on multilingual benchmarks and safety
Promote open science with accessible training scripts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source multilingual large language models
Fine-tuned for public-domain instruction data
Preliminary multimodal experiments for potential applications
🔎 Similar Papers
No similar papers found.
Aitor Gonzalez-Agirre
Aitor Gonzalez-Agirre
Barcelona Supercomputing Center (BSC)
Artificial IntelligenceNatural Language ProcessingSemanticsDeep Learning
M
Marc Pamies
Barcelona Supercomputing Center
J
Joan Llop
Barcelona Supercomputing Center
Irene Baucells
Irene Baucells
Barcelona Supercomputing Center
NLP
S
Severino Da Dalt
Barcelona Supercomputing Center
Daniel Tamayo
Daniel Tamayo
Harvey Mudd College
Orbital DynamicsPlanetary ScienceChaos
J
J. Saiz
Barcelona Supercomputing Center
F
Ferran Espuña
Barcelona Supercomputing Center
J
Jaume Prats
Barcelona Supercomputing Center
J
Javier Aula-Blasco
Barcelona Supercomputing Center
M
Mario Mina
Barcelona Supercomputing Center
A
Adri'an Rubio
Barcelona Supercomputing Center
Alexander Shvets
Alexander Shvets
Universitat Pompeu Fabra
information extractioncomputational lexicographytext generationhate speech analysis
A
Anna Sall'es
Barcelona Supercomputing Center
I
Inaki Lacunza
Barcelona Supercomputing Center
I
Inigo Pikabea
Barcelona Supercomputing Center
J
Jorge Palomar
Barcelona Supercomputing Center
Júlia Falcão
Júlia Falcão
Barcelona Supercomputing Center (BSC)
NLPAI ethicsbiasLLM evaluation
L
Luc'ia Tormo
Barcelona Supercomputing Center
L
Luis Vasquez-Reina
Barcelona Supercomputing Center
Montserrat Marimon
Montserrat Marimon
Universitat Pompeu Fabra
Computational LinguisticsNatural Language Processing
V
Valle Ru'iz-Fern'andez
Barcelona Supercomputing Center
Marta Villegas
Marta Villegas
Barcelona Supercomputing Center
Natural Language Processing