Salamandra Technical Report

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

There is a lack of fully open, reproducible multilingual large language models (LLMs) supporting robust text understanding, generation, and code capabilities across diverse languages. Method: We trained the Salamandra series of decoder-only LLMs (2B, 7B, and 40B parameters) from scratch on a custom-built, openly licensed multilingual corpus encompassing 35 European languages and programming languages, with mixed text–code data; models support supervised instruction tuning and preliminary multimodal adaptation. Contribution/Results: This work presents the first end-to-end open, reproducible training pipeline for multilingual LLMs at full scale. We release all models under the Apache 2.0 license, along with training and evaluation scripts and comprehensive benchmarking results—including multilingual, fairness, safety, and robustness evaluations. Empirical results demonstrate competitive performance against leading open-source models of comparable size, significantly advancing open science and trustworthy AI ecosystems.

Technology Category

Application Category

📝 Abstract

This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and safety.With this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.

Problem

Research questions and friction points this paper is trying to address.

Develop open-source multilingual large language models

Evaluate models on multilingual benchmarks and safety

Promote open science with accessible training scripts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source multilingual large language models

Fine-tuned for public-domain instruction data

Preliminary multimodal experiments for potential applications

🔎 Similar Papers

No similar papers found.