Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of effectively evaluating argument quality by focusing on three key dimensions: logical coherence, rhetorical strength, and dialectical depth, while assessing the alignment between large language models (LLMs) and human expert judgments. Leveraging 12 open-source LLMs, the work conducts pairwise argument comparisons under zero-shot, few-shot, and chain-of-thought prompting settings. Using the Bradley–Terry model, latent strength scores are inferred to produce quantifiable rankings. This represents the first systematic evaluation of multidimensional argument quality assessment in terms of agreement between LLMs and human experts. Among the models tested, Llama-70B demonstrates the highest consistency with expert ratings (Cohen’s κ = 0.493; correlation coefficients ranging from 0.327 to 0.477) and exhibits robust prediction stability, with a label fluctuation rate below 7.75%.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs can effectively perform this task. We tested 12 open-weight LLMs of different sizes and families under zero-shot, few-shot, and chain-of-thought to approximate expert pairwise comparisons of argument quality across three dimensions-logical, rhetorical, and dialectic-and used these comparisons in a Bradley-Terry model to infer latent strength scores and derive a ranking of arguments. Our insights show that LLMs have promising but moderate correlation with human expert judgments, with Llama-70B obtaining the strongest alignment, reaching moderate Cohen's $κ$ = 0.493 and moderate correlations with Bradley-Terry scores derived from these annotations (Kendall, Pearson, and Spearman: 0.327-0.477). Other LLMs exhibit weak, moderate, or high alignment with Llama-70B while achieving comparable results against human experts, suggesting partial but complementary understanding of underlying quality dimensions despite differences in model size and family. Moreover, LLM predictions are stable across trial runs, with fewer than 7.75\% of cases yielding different labels. Remaining variability is handled via majority voting and few-shot prompting for large-size models.
Problem

Research questions and friction points this paper is trying to address.

Argument Quality Assessment
Large Language Models
Bradley-Terry Model
Expert Judgment
Pairwise Comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

Argument Quality Assessment
Large Language Models
Bradley-Terry Model
Pairwise Comparison
Chain-of-Thought Prompting
🔎 Similar Papers
No similar papers found.