PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of dedicated evaluation frameworks for Portuguese large language models (LLMs) and the limited understanding of cross-lingual performance disparities, this work introduces PoETa v2—the first large-scale, multi-task Portuguese benchmark, comprising 40+ natural language understanding, generation, and reasoning tasks. Leveraging PoETa v2, we systematically evaluate over 20 open- and closed-source LLMs across diverse parameter scales and computational budgets, and conduct the first rigorous Portuguese–English cross-lingual performance analysis. Results reveal persistent deficiencies in linguistic adaptation and cultural contextual modeling; moreover, computational investment and language-specific fine-tuning exhibit a nonlinear interaction effect on performance gains. PoETa v2 is publicly released to support fair, reproducible, and linguistically grounded evaluation of non-English LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) exhibit significant variations in performance across linguistic and cultural contexts, underscoring the need for systematic evaluation in diverse languages. In this work, we present the most extensive evaluation of LLMs for the Portuguese language to date. Leveraging our newly introduced PoETa v2 benchmark -- a comprehensive suite of over 40 tasks in Portuguese -- we assess more than 20 models covering a broad spectrum of training scales and computational resources. Our study reveals how computational investment and language-specific adaptation impact performance in Portuguese, while also analyzing performance gaps in comparison to equivalent tasks in English. Through this benchmark and analysis, PoETa v2 lays the groundwork for future research on Portuguese language modeling and evaluation. The benchmark is available at https://github.com/PoETaV2/PoETaV2.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance variations in Portuguese linguistic contexts
Assessing over 20 models using comprehensive Portuguese benchmark suite
Analyzing computational investment and language adaptation impacts on Portuguese
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced PoETa v2 benchmark for Portuguese evaluation
Assessed over 20 models across 40+ Portuguese tasks
Analyzed computational investment and language adaptation impacts
🔎 Similar Papers
No similar papers found.