Tucano 2 Cool: Better Open Source LLMs for Portuguese

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses significant limitations in existing open-source large language models (LLMs) for Portuguese, particularly in data quality and scale, as well as capabilities in reasoning, tool use, and code generation. To overcome these gaps, the authors introduce the Tucano 2 series of open-source LLMs (0.5–3.7B parameters), which constitute the first comprehensive Portuguese LLM framework encompassing foundational knowledge, instruction following, and chain-of-thought reasoning. The models are developed through an expanded high-quality corpus (GigaVerbo-v2), augmented with synthetic and instruction-tuning data, and trained via a pipeline integrating pretraining, continued pretraining, supervised fine-tuning, and preference alignment. The project fully releases training recipes, logs, and code to enhance reproducibility, achieves state-of-the-art performance across multiple Portuguese benchmarks, and provides a dedicated evaluation suite for end-to-end model assessment.

Technology Category

Application Category

📝 Abstract

We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.

Problem

Research questions and friction points this paper is trying to address.

Portuguese LLMs

open-source models

training data gaps

multitask language modeling

reproducible NLP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Portuguese LLMs

synthetic data

continual pretraining

instruction tuning

open-source models

🔎 Similar Papers

No similar papers found.

Authors to Follow