TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing benchmarks inadequately assess the real-world performance and deployment risks of large language models (LLMs) in judicial practice. This work introduces the first evaluation benchmark tailored to the Korean legal context, constructed from authentic court rulings and encompassing four core tasks: case summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly evaluates LLM behavior across multiple risk dimensions—including accuracy, bias, consistency, and judicial overreach—through a structured task design grounded in juristically validated requirements. Systematic risk diagnostics on mainstream LLMs reveal significant deficiencies in precedent retrieval and the identification of critical legal information, underscoring the necessity of rigorous scrutiny of LLM outputs before deployment in judicial settings.

📝 Abstract

Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko

Problem

Research questions and friction points this paper is trying to address.

LLM risks

judicial workflows

legal benchmark

deployment risks

hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

judicial benchmark

LLM risk evaluation

legal NLP