USDC: A Dataset of User Stance and Dogmatism in Long Conversations

📅 2024-06-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the lack of dynamic modeling for user stance and dogmatism evolution in long conversations. We introduce USDC, the first fine-grained annotated dataset for multi-turn Reddit dialogues (764 conversations), covering five stance categories and four dogmatism levels. Departing from prior approaches that treat posts in isolation, we propose a novel LLM-augmented crowdsourcing framework: integrating zero-, one-, and few-shot reasoning from Mistral Large and GPT-4, combined with majority voting, to mitigate stance drift, implicit shifts, and annotation noise. We further design prompt engineering techniques and instruction-tuning strategies for small language models (Llama-3, Phi-3). The USDC dataset and code are publicly released. Our fine-tuned models achieve F1 scores of 72.3% on stance classification and 68.9% on dogmatism classification—substantially outperforming supervised baselines.

Technology Category

Application Category

📝 Abstract

Identifying user's opinions and stances in long conversation threads on various topics can be extremely critical for enhanced personalization, market research, political campaigns, customer service, conflict resolution, targeted advertising, and content moderation. Hence, training language models to automate this task is critical. However, to train such models, gathering manual annotations has multiple challenges: 1) It is time-consuming and costly; 2) Conversation threads could be very long, increasing chances of noisy annotations; and 3) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Inspired by the recent success of large language models (LLMs) for complex natural language processing (NLP) tasks, we leverage Mistral Large and GPT-4 to automate the human annotation process on the following two tasks while also providing reasoning: i) User Stance classification, which involves labeling a user's stance of a post in a conversation on a five-point scale; ii) User Dogmatism classification, which deals with labeling a user's overall opinion in the conversation on a four-point scale. The majority voting on zero-shot, one-shot, and few-shot annotations from these two LLMs on 764 multi-user Reddit conversations helps us curate the USDC dataset. USDC is then used to finetune and instruction-tune multiple deployable small language models for the 5-class stance and 4-class dogmatism classification tasks. We make the code and dataset publicly available [https://anonymous.4open.science/r/USDC-0F7F].

Problem

Research questions and friction points this paper is trying to address.

Analyzing user stance changes in long Reddit conversations

Classifying user dogmatism levels in multi-user discussions

Automating annotation of opinion shifts using LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Built USDC dataset for user stance and dogmatism

Used Mistral and GPT-4 for automated annotations

Fine-tuned small models like LLaMA and Vicuna

🔎 Similar Papers

No similar papers found.