🤖 AI Summary
This study investigates whether performance gains in large language model (LLM) judges stem from atomic decomposition itself or from more elaborate prompt design. By systematically comparing self-decomposed atomic judges against end-to-end holistic judges on reference-supported classification tasks—while controlling for prompt complexity—it provides the first rigorous evaluation of atomic decomposition’s efficacy. The experiments employ paired source-level testing, cluster-based bootstrapping, validation across multiple model families, and three pre-frozen prompt variants, evaluated on TruthfulQA, ASQA, and QAMPARI. Results show that holistic judges significantly outperform atomic judges on ASQA and QAMPARI, particularly in identifying partially supported answers, with atomic judges exhibiting only a marginal advantage on TruthfulQA. These findings are consistently corroborated by human annotations.
📝 Abstract
Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges. However, atomic prompts are typically richer and longer, making it unclear whether any advantage comes from decomposition or from richer prompting. We study this for benchmark-style completeness-sensitive reference-support classification: classifying a candidate as fully supported, partially supported, or unsupported relative to a supplied reference. We compare a self-decomposing atomic judge (single-prompt decompose-and-verify) against a prompt-controlled holistic judge with the same inputs and a similarly detailed rubric. On 200 source examples per dataset across TruthfulQA, ASQA, and QAMPARI, with four model families, source-level paired tests, cluster bootstrap, and aggregation across three pre-frozen prompt variants per design family, we find the holistic judge matches or exceeds the atomic judge on two of three benchmarks: ASQA and QAMPARI favor holistic across all four families (statistically reliable in three of four), while TruthfulQA shows a small atomic edge. The holistic advantage is concentrated in partially\_supported cases -- incompleteness detection. A sensitivity check against human annotations confirms the ranking under both benchmark-completeness and human factual-correctness standards. Our finding is specific to the self-decomposing single-prompt pattern on three QA-style benchmarks with 200 source examples each; multi-stage atomic pipelines and non-QA tasks remain untested. Among perturbations examined, reference-quality degradation produced the largest accuracy drops for both judge families.