🤖 AI Summary
Current large language model (LLM) evaluators for multi-constraint instruction-following tasks rely solely on holistic judgments, making it difficult to assess the fulfillment of individual constraints with fine-grained precision. This work proposes MCJudgeBench, the first constraint-level evaluation framework for such tasks, which systematically measures evaluator correctness and consistency by leveraging explicit constraint lists, per-constraint ternary labels (yes/partial/no), controlled response perturbations, and prompt variants incorporating chain-of-thought reasoning. The framework further distinguishes between intrinsic stochastic inconsistency and procedural inconsistency. Experimental results reveal that strong overall performance does not guarantee reliable detection of rare labels (e.g., partial or no); high correctness does not necessarily imply low inconsistency; and while integrating chain-of-thought reasoning can improve correctness, it does not consistently enhance stability.
📝 Abstract
Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.