🤖 AI Summary
This work proposes Magis-Bench, the first benchmark specifically designed to evaluate judicial reasoning capabilities of large language models (LLMs), addressing the gap in existing legal AI benchmarks that predominantly focus on legal argumentation or document generation while neglecting systematic assessment of judicial judgment skills—such as weighing claims, applying legal norms, and rendering decisions. Built upon 74 structured questions from Brazil’s judicial entrance examinations (2023–2025), Magis-Bench encompasses multi-step legal analysis and full judgment drafting tasks. Leveraging an LLM-as-a-judge paradigm with four state-of-the-art models as independent evaluators, the benchmark achieves high inter-rater consistency (Kendall’s W = 0.984). Among 23 leading models, Gemini-3-Pro-Preview attains the highest score (6.97/10), yet all fall short of 70%, revealing a substantial deficit in judicial-grade legal reasoning and writing. The dataset, model outputs, and evaluation code are publicly released.
📝 Abstract
Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emph{judge} such arguments -- weighing competing claims, applying doctrine to facts, and rendering reasoned decisions -- is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators. Our results show strong inter-judge agreement (Kendall's $W = 0.984$; pairwise Kendall's $τ\ge 0.897$), with Google's Gemini-3-Pro-Preview achieving the highest average score (6.97/10), followed by Gemini-3-Flash-Preview (6.67) and Claude-4.5-Opus (6.46). Even the best-performing models score below 70\% of the maximum, indicating that judicial-level legal reasoning and writing remain challenging for current LLMs. We release the complete benchmark, model outputs, and evaluation code to support further research on legal AI capabilities.