🤖 AI Summary
This study addresses the problem of explaining inter-annotator disagreement—i.e., divergent labels assigned to identical data instances—a challenge where existing interpretability methods fall short. We propose a novel explanation paradigm grounded in forward-chain reasoning over large language model (LLM) chain-of-thought (CoT) outputs: (1) CoT paths are parsed to automatically extract supporting or opposing statements for each candidate answer; (2) linguistics-informed discourse segmentation enables fine-grained evidence extraction; and (3) a ranking-oriented Human-Likeness Validation (HLV) framework is introduced to better align with human annotation preferences. Evaluated on three benchmark datasets, our method significantly outperforms direct generation and state-of-the-art baselines, achieving superior consistency between predicted answer rankings and empirical human label distributions. Results empirically validate that CoT traces encode annotator rationale—and that this rationale is both meaningful and effectively recoverable.
📝 Abstract
The recent rise of reasoning-tuned Large Language Models (LLMs)--which generate chains of thought (CoTs) before giving the final answer--has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.