🤖 AI Summary
This study investigates the efficacy of chain-of-thought (CoT) reasoning in probabilistic, ambiguous tasks that model human label variability, moving beyond conventional deterministic single-answer settings. The authors propose Cross-CoT, a decoupling framework that integrates distribution alignment metrics, variance contribution analysis, and reasoning trace tracking to disentangle, for the first time, the distinct influences of CoT-generated rationales and model priors on output distributions from the perspective of human annotation variation. Their findings reveal that CoT overwhelmingly governs final answer selection—accounting for 99% of accuracy variance—whereas model priors predominantly shape the output distribution’s structure, influencing over 80% of its ranking properties. Moreover, while CoT monotonically improves accuracy during inference, it exerts limited influence on distributional form, thereby challenging the prevailing assumption that CoT enables fine-grained calibration of output distributions.
📝 Abstract
Reasoning-tuned LLMs utilizing long Chain-of-Thought (CoT) excel at single-answer tasks, yet their ability to model Human Label Variation--which requires capturing probabilistic ambiguity rather than resolving it--remains underexplored. We investigate this through systematic disentanglement experiments on distribution-based tasks, employing Cross-CoT experiments to isolate the effect of reasoning text from intrinsic model priors. We observe a distinct"decoupled mechanism": while CoT improves distributional alignment, final accuracy is dictated by CoT content (99% variance contribution), whereas distributional ranking is governed by model priors (over 80%). Step-wise analysis further shows that while CoT's influence on accuracy grows monotonically during the reasoning process, distributional structure is largely determined by LLM's intrinsic priors. These findings suggest that long CoT serves as a decisive LLM decision-maker for the top option but fails to function as a granular distribution calibrator for ambiguous tasks.