Tucano 2 Cool: Better Open Source LLMs for Portuguese

πŸ“… 2026-03-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses significant limitations in existing open-source large language models (LLMs) for Portuguese, particularly in data quality and scale, as well as capabilities in reasoning, tool use, and code generation. To overcome these gaps, the authors introduce the Tucano 2 series of open-source LLMs (0.5–3.7B parameters), which constitute the first comprehensive Portuguese LLM framework encompassing foundational knowledge, instruction following, and chain-of-thought reasoning. The models are developed through an expanded high-quality corpus (GigaVerbo-v2), augmented with synthetic and instruction-tuning data, and trained via a pipeline integrating pretraining, continued pretraining, supervised fine-tuning, and preference alignment. The project fully releases training recipes, logs, and code to enhance reproducibility, achieves state-of-the-art performance across multiple Portuguese benchmarks, and provides a dedicated evaluation suite for end-to-end model assessment.

Technology Category

Application Category

πŸ“ Abstract
We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.
Problem

Research questions and friction points this paper is trying to address.

Portuguese LLMs
open-source models
training data gaps
multitask language modeling
reproducible NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Portuguese LLMs
synthetic data
continual pretraining
instruction tuning
open-source models
πŸ”Ž Similar Papers
No similar papers found.
N
Nicholas Kluge CorrΓͺa
Bonn-Aachen International Center for Information Technology (b-it) / CAISA Lab; Lamarr Institute for Machine Learning and Artificial Intelligence; Center for Science and Thought
A
Aniket Sen
Helmholtz-Institut fΓΌr Strahlen- und Kernphysik
S
Shiza Fatimah
Bonn-Aachen International Center for Information Technology (b-it) / CAISA Lab; Lamarr Institute for Machine Learning and Artificial Intelligence
S
Sophia Falk
Bonn Sustainable AI Lab
L
Lennard Landgraf
Center for Science and Thought
J
Julia Kastner
Center for Science and Thought
Lucie Flek
Lucie Flek
University of Bonn, Lamarr Institute of Machine Learning and Artificial Intelligence
Natural Language ProcessingMachine LearningPhysicsComputational Social Sciences