🤖 AI Summary
This study addresses the instability in evaluating large language models on authentic Brazilian Portuguese dialogues, which stems from biases inherent in holistic scoring approaches that rely on judge models. To mitigate this issue, the authors propose a fine-grained evaluation framework based on binary pairwise comparisons and multi-judge filtering. The approach substantially improves ranking consistency, achieving full agreement among three judges across 16 models—compared to only seven under traditional holistic scoring—and increases the average score gap between adjacent models by 47%. The work introduces Prosa, the first multi-turn Brazilian Portuguese dialogue benchmark, built from WildChat data and evaluated using judge models such as Gemini 1.5 Flash at an approximate cost of $2.10 per evaluation. Both the benchmark and filtering code are publicly released. Empirical results demonstrate that the design of scoring rules exerts a far greater influence on ranking consistency than the choice of judge model.
📝 Abstract
Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. To support this claim, we introduce Prosa, the first real user multi-turn Brazilian Portuguese chat benchmark: 1,000 WildChat conversations scored by three judges from three model families on 16 models. Under filtered rubric scoring the three judges agree on every one of the 16 ranks, whereas under holistic scoring they agree on only 7 of 16. Additionally, the rubric filtering pipeline increases the average score gap between neighbouring models by 47%, thereby improving Prosa's discriminative power. Evaluating a new model on Prosa costs approximately $2.1 when using Gemini 3 Flash as the judge. We release the benchmark and the filtering code to ensure that future models can be assessed under identical conditions. These artifacts also make our rubric-based scoring method reusable beyond Prosa, supporting other open-ended evaluation settings.