Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scoring of Texts with Large Language Models

๐Ÿ“… 2023-10-18
๐Ÿ›๏ธ BigData Congress [Services Society]
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF

career value

194K/year
๐Ÿค– AI Summary
To address the challenges of label dependency, poor generalizability, and low agreement with human judgments in unsupervised short-text scoring, this paper proposes Concept-Guided Chain-of-Thought (CGCoT): a framework that reformulates pairwise comparison as an interpretable, concept-decomposition-based pattern recognition taskโ€”requiring neither annotated corpora nor domain-specific labels. CGCoT integrates manually crafted multi-stage prompting, LLM-driven concept generation, pairwise comparison, and probabilistic aggregation, and is compatible with diverse open- and closed-source LLMs. Evaluated on political slant scoring of tweets, CGCoT achieves significantly higher Spearman correlation than Wordfish. With only a small pilot dataset, it matches the performance of fine-tuned RoBERTa-Large trained on thousands of samples. To our knowledge, CGCoT is the first method enabling zero-shot, interpretable, and reusable fine-grained scoring for short texts.
๐Ÿ“ Abstract
Existing text scoring methods require a large corpus, struggle with short texts, or require hand-labeled data. We develop a text scoring framework that leverages generative large language models (LLMs) to (1) set texts against the backdrop of information from the near-totality of the web and digitized media, and (2) effectively transform pairwise text comparisons from a reasoning problem to a pattern recognition task. Our approach, concept-guided chain-of-thought (CGCoT), utilizes a chain of researcher-designed prompts with an LLM to generate a concept-specific breakdown for each text, akin to guidance provided to human coders. We then pairwise compare breakdowns using an LLM and aggregate answers into a score using a probability model. We apply this approach to better understand speech reflecting aversion to specific political parties on Twitter, a topic that has commanded increasing interest because of its potential contributions to democratic backsliding. We achieve stronger correlations with human judgments than widely used unsupervised text scoring methods like Wordfish. In a supervised setting, besides a small pilot dataset to develop CGCoT prompts, our measures require no additional hand-labeled data and produce predictions on par with RoBERTa-Large fine-tuned on thousands of hand-labeled tweets. This project showcases the potential of combining human expertise and LLMs for scoring tasks.
Problem

Research questions and friction points this paper is trying to address.

Short Text Scoring
Data Dependency
Inter-Rater Reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept-Guided Chain of Thought
large language models
comparative text scoring
๐Ÿ”Ž Similar Papers
No similar papers found.