Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study addresses the problem of explaining inter-annotator disagreement—i.e., divergent labels assigned to identical data instances—a challenge where existing interpretability methods fall short. We propose a novel explanation paradigm grounded in forward-chain reasoning over large language model (LLM) chain-of-thought (CoT) outputs: (1) CoT paths are parsed to automatically extract supporting or opposing statements for each candidate answer; (2) linguistics-informed discourse segmentation enables fine-grained evidence extraction; and (3) a ranking-oriented Human-Likeness Validation (HLV) framework is introduced to better align with human annotation preferences. Evaluated on three benchmark datasets, our method significantly outperforms direct generation and state-of-the-art baselines, achieving superior consistency between predicted answer rankings and empirical human label distributions. Results empirically validate that CoT traces encode annotator rationale—and that this rationale is both meaningful and effectively recoverable.

Technology Category

Application Category

📝 Abstract

The recent rise of reasoning-tuned Large Language Models (LLMs)--which generate chains of thought (CoTs) before giving the final answer--has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Explaining human label variation using LLM-generated chains of thought

Extracting supporting and opposing statements from CoTs with discourse segmenters

Improving alignment of answer rankings with human label distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based pipeline with discourse segmenters

Extracts supporting and opposing statements from CoTs

Rank-based HLV evaluation framework

🔎 Similar Papers

Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ?