VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision

πŸ“… 2025-10-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing chain-of-thought (CoT) supervised fine-tuning (SFT) relies on uniform cross-entropy loss across all tokens, leading to suboptimal supervision signal allocation and poor generalization. To address this, we propose VCOREβ€”a variance-aware gradient reweighting framework grounded in constrained optimization theory. VCORE dynamically modulates the supervision strength per token along reasoning trajectories by introducing a novel variance-aware mechanism that amplifies gradient contributions from critical reasoning steps. Crucially, VCORE requires no additional human annotations or reinforcement learning, and integrates seamlessly into standard SFT pipelines. Empirical evaluation on mathematical and code reasoning tasks demonstrates consistent and significant improvements over state-of-the-art methods under both in-domain and cross-domain settings. Moreover, models initialized with VCORE yield superior performance when subsequently fine-tuned via RLHF. We validate VCORE’s effectiveness on Qwen3 and LLaMA-3.1-8B-Instruct.

Technology Category

Application Category

πŸ“ Abstract
Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce extbf{V}ariance- extbf{C}ontrolled extbf{O}ptimization-based extbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.
Problem

Research questions and friction points this paper is trying to address.

Optimizes token-level supervision in chain-of-thought training
Addresses misallocated supervision in long reasoning trajectories
Improves generalization for mathematical and coding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

VCORE optimizes token supervision via constrained optimization
Adaptively allocates supervision across reasoning trajectory tokens
Enhances generalization in mathematical and coding benchmarks
πŸ”Ž Similar Papers
No similar papers found.
X
Xuan Gong
Shanghai Jiao Tong University
S
Senmiao Wang
Chinese University of Hong Kong (Shenzhen)
H
Hanbo Huang
Shanghai Jiao Tong University
R
Ruoyu Sun
Chinese University of Hong Kong (Shenzhen)
Shiyu Liang
Shiyu Liang
University of Illinois at Urbana-Champaign
Machine LearningOptimizationApplied Probability