Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) can implicitly model and approximate quantitative argumentation debate (QuAD) semantics—used for ranking argument plausibility in natural-language debates—without explicit access to argument graph structures. Method: We introduce QuAD semantics into LLM evaluation for the first time, leveraging the dialogue-based NoDE dataset and employing chain-of-thought prompting and in-context learning to elicit structured reasoning from LLMs. Contribution/Results: Results show moderate rank correlation (ρ ≈ 0.45) between LLM outputs and QuAD scores on short texts, confirming nascent capacity for nonlinear argumentative reasoning. However, performance degrades substantially on longer texts or when discourse coherence breaks down, indicating limited implicit capture of argument topology. This work establishes a novel evaluation paradigm for LLM reasoning grounded in acceptability semantics, bridging formal argumentation theory and LLM assessment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) excel at linear reasoning tasks but remain underexplored on non-linear structures such as those found in natural debates, which are best expressed as argument graphs. We evaluate whether LLMs can approximate structured reasoning from Computational Argumentation Theory (CAT). Specifically, we use Quantitative Argumentation Debate (QuAD) semantics, which assigns acceptability scores to arguments based on their attack and support relations. Given only dialogue-formatted debates from two NoDE datasets, models are prompted to rank arguments without access to the underlying graph. We test several LLMs under advanced instruction strategies, including Chain-of-Thought and In-Context Learning. While models show moderate alignment with QuAD rankings, performance degrades with longer inputs or disrupted discourse flow. Advanced prompting helps mitigate these effects by reducing biases related to argument length and position. Our findings highlight both the promise and limitations of LLMs in modeling formal argumentation semantics and motivate future work on graph-aware reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to judge debates using argumentation theory
Testing if LLMs can rank arguments without access to argument graphs
Assessing performance degradation with longer inputs and disrupted discourse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using QuAD semantics for argument scoring
Testing LLMs with advanced prompting strategies
Evaluating graph-free argument ranking performance