A Systematic Comparison of Prompting and Multi-Agent Methods for LLM-based Stance Detection

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study addresses the challenge of fairly comparing existing stance detection methods, which has been hindered by inconsistent data splits, base models, and evaluation protocols. For the first time, it systematically evaluates five method categories—three prompt-based reasoning approaches and two multi-agent debate frameworks—across 14 subtasks on four benchmark datasets, leveraging 15 large language models from six families with parameter scales ranging from 7B to over 72B. Results show that prompt-based methods consistently outperform multi-agent approaches while reducing API calls by 7–12×. Model scale exerts a stronger influence on performance than method choice, with gains saturating around 32B parameters. Notably, reasoning-enhancement strategies yield no significant improvement in stance detection accuracy. This work establishes a reproducible benchmark and offers practical guidance for method selection in stance detection research.

📝 Abstract

Stance detection identifies the attitude of a text author toward a given target. Recent studies have explored various LLM-based strategies for this task, from zero-shot prompting to multi-agent debate. However, existing works differ in data splits, base models, and evaluation protocols, making fair comparison difficult. We conduct a systematic comparison that evaluates five methods across two categories -- prompt-based inference (Direct Prompting, Auto-CoT, StSQA) and agent-based debate (COLA, MPRF) -- on four datasets with 14 subtasks, using 15 LLMs from six model families with parameter sizes from 7B to 72B+. Our experiments yield several findings. First, on all models with complete results, the best prompt-based method outperforms the best agent-based method, while agent methods require 7 to 12 times more API calls per sample. Second, model scale has a larger impact on performance than method choice, with gains plateauing around 32B. Third, reasoning-enhanced models (DeepSeek-R1) do not consistently outperform general models of the same size on this task.

Problem

Research questions and friction points this paper is trying to address.

stance detection

large language models

prompting methods

multi-agent methods

systematic comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

stance detection

prompting methods

multi-agent debate