Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cognitive distortion detection suffers from low inter-annotator agreement (e.g., low Cohen’s kappa among experts) and subjective labeling, resulting in unreliable datasets and inconsistent model evaluation. To address this, we propose the first multi-turn, independent annotation paradigm leveraging GPT-4 to elicit stable label patterns. We further introduce an unbiased, human-annotation-free evaluation framework that integrates cross-dataset analysis with Fleiss’ kappa (achieving 0.78 inter-LLM agreement) and Cohen’s kappa (enabling fair model comparability). Additionally, we release a standardized training pipeline and evaluation protocol. Experiments demonstrate that models trained on LLM-annotated data significantly outperform those trained on human-annotated baselines, validating our approach’s effectiveness in enhancing generalizability and comparability for subjective NLP tasks.

Technology Category

Application Category

📝 Abstract
Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.
Problem

Research questions and friction points this paper is trying to address.

Improving annotation consistency in cognitive distortion detection tasks
Developing dataset-agnostic evaluation for fair model comparisons
Using LLMs as scalable alternatives to human annotators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs as consistent annotators for cognitive distortion detection
Multiple independent LLM runs reveal stable labeling patterns
Introducing dataset-agnostic evaluation framework with Cohen's kappa
🔎 Similar Papers
No similar papers found.
N
Neha Sharma
University of Tartu, Estonia
N
Navneet Agarwal
University of Tartu, Estonia
Kairit Sirts
Kairit Sirts
University of Tartu
Natural Language ProcessingComputational LinguisticsComputational Psychology#unitartucs