Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the implicit biases and insufficient interpretability of LLM-as-a-Judge when substituting for human evaluation. We propose the CLoVE and GloVE dual-algorithm framework—the first to systematically distill locally contrastive explanations into verifiable, globally consistent evaluation strategies. Our method integrates concept-driven contrastive explanation generation, iterative clustering, abstractive summarization, and formal verification, enabling concept-level, high-fidelity, and reproducible strategy extraction. Evaluated across seven content harm detection benchmarks, the extracted strategies demonstrate robustness and high fidelity (average F1 ≥ 0.92). A user study confirms significant improvements in evaluators’ comprehension (+41.3%) and trust satisfaction (+38.7%). The core contribution is the establishment of the first analytical paradigm for LLM judges that jointly ensures interpretability, formal verifiability, and generalizable strategy learning.

Technology Category

Application Category

📝 Abstract
Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.
Problem

Research questions and friction points this paper is trying to address.

Understanding biases in LLM-based text evaluation systems
Extracting verifiable concept-based policies from LLM judges
Analyzing robustness of LLM judge policies to attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts concept-based global policies from LLM-as-a-Judge
Uses contrastive local explanations algorithm CLoVE
Condenses rules via clustering and verification GloVE
🔎 Similar Papers
No similar papers found.
J
Jasmina Gajcin
IBM Research Ireland
Erik Miehling
Erik Miehling
IBM Research
controlreinforcement learninggame theoryartificial intelligence
R
Rahul Nair
IBM Research Ireland
E
Elizabeth Daly
IBM Research Ireland
Radu Marinescu
Radu Marinescu
IBM Research
Artificial Intelligence
S
Seshu Tirupathi
IBM Research Ireland