CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current large language models (LLMs) exhibit insufficient reliability in detecting complex, domain-specific compliance violations in dialogue systems and lack systematic evaluation benchmarks alongside high-quality annotated data. This work proposes CompliBench—the first systematic benchmark for conversational compliance detection—leveraging automated multi-turn dialogue generation, controllable violation injection, and adversarial search to construct challenging, precisely labeled non-compliant samples. Evaluations using this benchmark reveal that leading closed-source LLMs perform markedly poorly in this task. In contrast, lightweight judge models fine-tuned on synthetically generated data not only surpass these LLMs in performance but also demonstrate strong cross-domain generalization capabilities.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge

compliance violation detection

dialogue systems

benchmark

policy adherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

CompliBench

LLM-as-a-Judge

automated data generation