🤖 AI Summary
This work addresses the inconsistency of large language models in debate evaluation, which stems from their reliance on a single holistic score and lack of transparent analysis of argument structure. The authors propose GRASP, a novel framework that reformulates evaluation as a structured ranking based on argument interaction graphs. By making deterministic judgments about local attack and support relations and applying a convergent propagation operator over these interactions, GRASP iteratively computes global structural adequacy. Crucially, the approach eschews holistic scoring and explicitly disentangles structural robustness from rhetorical factors such as persuasiveness and factuality. Experimental results demonstrate that GRASP yields rankings with higher reproducibility and cross-model consistency, while remaining uncorrelated with human “persuasiveness” labels, thereby effectively isolating and assessing the intrinsic structural quality of arguments.
📝 Abstract
Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.