A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Prior work lacks systematic evaluation of large language models (LLMs) on argument mining tasks—specifically argument component identification and relation classification—across standard benchmarks (Args.me, UKP). Method: This study conducts the first comprehensive assessment of leading LLMs (GPT-4o, Llama-3, DeepSeek-R1) using chain-of-thought (CoT) and other prompting strategies, complemented by rigorous error analysis and dataset diagnostics. Contribution/Results: (1) GPT-4o achieves overall best performance; DeepSeek-R1 surpasses others under reasoning-augmented prompting. (2) Existing prompting methods suffer from systemic flaws—including logical chain fragmentation and neglect of implicit premises. (3) Public datasets exhibit structural limitations: coarse-grained annotations, incomplete relation coverage, and insufficient contextual grounding. The work proposes a prompting optimization framework tailored to argument mining and principled guidelines for dataset construction, establishing an empirical foundation and actionable pathways toward trustworthy LLM deployment in computational argumentation.

Technology Category

Application Category

📝 Abstract

Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM's, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' performance in argument classification tasks

Comparing reasoning-enhanced LLMs across diverse argument datasets

Identifying common errors and limitations in LLM-based argument analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes diverse LLMs including GPT, Llama, DeepSeek

Incorporates Chain-of-Thoughts algorithm for reasoning

Tests models on Args.me and UKP datasets

🔎 Similar Papers

No similar papers found.