Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the efficacy of chain-of-thought (CoT) reasoning in probabilistic, ambiguous tasks that model human label variability, moving beyond conventional deterministic single-answer settings. The authors propose Cross-CoT, a decoupling framework that integrates distribution alignment metrics, variance contribution analysis, and reasoning trace tracking to disentangle, for the first time, the distinct influences of CoT-generated rationales and model priors on output distributions from the perspective of human annotation variation. Their findings reveal that CoT overwhelmingly governs final answer selection—accounting for 99% of accuracy variance—whereas model priors predominantly shape the output distribution’s structure, influencing over 80% of its ranking properties. Moreover, while CoT monotonically improves accuracy during inference, it exerts limited influence on distributional form, thereby challenging the prevailing assumption that CoT enables fine-grained calibration of output distributions.

Technology Category

Application Category

📝 Abstract
Reasoning-tuned LLMs utilizing long Chain-of-Thought (CoT) excel at single-answer tasks, yet their ability to model Human Label Variation--which requires capturing probabilistic ambiguity rather than resolving it--remains underexplored. We investigate this through systematic disentanglement experiments on distribution-based tasks, employing Cross-CoT experiments to isolate the effect of reasoning text from intrinsic model priors. We observe a distinct"decoupled mechanism": while CoT improves distributional alignment, final accuracy is dictated by CoT content (99% variance contribution), whereas distributional ranking is governed by model priors (over 80%). Step-wise analysis further shows that while CoT's influence on accuracy grows monotonically during the reasoning process, distributional structure is largely determined by LLM's intrinsic priors. These findings suggest that long CoT serves as a decisive LLM decision-maker for the top option but fails to function as a granular distribution calibrator for ambiguous tasks.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought
Human Label Variation
Distributional Ambiguity
LLM Reasoning
Model Priors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought
Human Label Variation
Model Priors
Distributional Calibration
Decoupling Mechanism