🤖 AI Summary
This work addresses the limited effectiveness of uncertainty quantification (UQ) in argumentative large language models (ArgLLMs) when evaluating complex, contentious claims. We propose the first UQ evaluation paradigm grounded in computational argumentation frameworks. Methodologically, we integrate mainstream UQ techniques—including confidence calibration, ensemble sampling, and prompt engineering—into claim verification tasks and systematically benchmark their performance across multiple ArgLLMs. Crucially, we find that the simplest approach—direct prompting—significantly outperforms all sophisticated UQ methods in accuracy, calibration, and robustness. This challenges the prevailing “complexity implies superiority” assumption and reveals the unique efficacy of lightweight prompting strategies for uncertainty modeling in ArgLLMs. Our findings provide both empirical evidence and conceptual insight for developing more interpretable and trustworthy argumentative AI systems. (149 words)
📝 Abstract
Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs' performance on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods' effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.