🤖 AI Summary
Current large language models (LLMs) exhibit insufficient reliability in detecting complex, domain-specific compliance violations in dialogue systems and lack systematic evaluation benchmarks alongside high-quality annotated data. This work proposes CompliBench—the first systematic benchmark for conversational compliance detection—leveraging automated multi-turn dialogue generation, controllable violation injection, and adversarial search to construct challenging, precisely labeled non-compliant samples. Evaluations using this benchmark reveal that leading closed-source LLMs perform markedly poorly in this task. In contrast, lightweight judge models fine-tuned on synthetically generated data not only surpass these LLMs in performance but also demonstrate strong cross-domain generalization capabilities.
📝 Abstract
As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.