CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of a benchmark for instruction-following evaluation in Brazilian Portuguese that integrates literary context and supports automatic assessment. We present the first instruction-following dataset grounded in eight canonical Brazilian literary works, encompassing 59 automatically verifiable instruction types that emphasize linguistic specificity—such as morphological constraints involving suffixes like -ando, -inho, and -mente—and structured formatting requirements. To enable scalable evaluation, we introduce an automatic validation mechanism that requires neither human annotation nor large language model (LLM) intervention, along with a multi-turn dialogue framework to assess constraint adherence across conversational turns. Experiments reveal that the state-of-the-art model GPT-5.2 achieves 98.5% strict accuracy, while the cost-efficient Portuguese-specialized model Sabiazinho-4 attains 87.0%; notably, models exhibit substantial variation in maintaining constraints over multi-turn interactions, with performance ranging from 60% to 96%.

Technology Category

Application Category

📝 Abstract
We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at \$0.13 vs Claude-Haiku-4.5: 73.5% at \$1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.
Problem

Research questions and friction points this paper is trying to address.

instruction-following
Brazilian Portuguese
large language models
literary context
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-following
Brazilian Portuguese
literary context
automatically verifiable benchmark
morphological constraints
🔎 Similar Papers
No similar papers found.