Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

๐Ÿ“… 2026-05-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

195K/year
๐Ÿค– AI Summary
This work addresses the challenge of effectively integrating supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) in large language model post-training, where their combination often leads to catastrophic forgetting or gradient conflicts. From the perspective of task vectors, the study reveals heterogeneities between SFT and RLHF in magnitude, sign, and module-wise updates. To overcome this, the authors propose DoTS, a decoupled test-time synthesis framework that fuses their capabilities through task vector arithmetic during inferenceโ€”without any retraining. By incorporating selective sparsification, norm-preserving rescaling, and Bayesian optimization-driven Pareto coefficient search, DoTS matches or surpasses conventional joint training on multiple mathematical reasoning benchmarks, achieving state-of-the-art results on stronger checkpoints with only ~3% of the computational cost and demonstrating strong out-of-distribution generalization.
๐Ÿ“ Abstract
SFT and RLVR represent two fundamental yet distinct paradigms for LLM post-training, each excelling in distinct dimensions. SFT expands knowledge breadth while RLVR enhances reasoning depth. Yet integrating these complementary strengths remains a formidable challenge. Sequential training can cause catastrophic forgetting, and joint optimization often suffers from severe gradient conflicts. We analyze SFT and RLVR through the lens of task vectors and reveal three structural properties behind these failures: a 30* magnitude disparity, 45* sign interference, and heterogeneous module-wise update distributions. These findings show SFT and RLVR are difficult to integrate directly, but they also suggest that the two paradigms modify partly complementary components of the model. Motivated by these observations, we propose Decoupled Test-time Synthesis (DoTS), a post-hoc framework allows SFT and RLVR checkpoints to be trained independently and synthesizes their capabilities only at inference time via task vector arithmetic, without updating model parameters. To reduce interference, DOTS applies selective sparsification with norm-preserving rescaling. It then uses Bayesian optimization on a small set of unlabeled queries to search for combination coefficients on the Pareto frontier of consistency and perplexity. Empirically, \ours matches or exceeds the performance of training-based SFT--RLVR integration methods across multiple mathematical reasoning benchmarks, incurring only $\sim$3\% of the computational cost. When applied to stronger post-trained checkpoints, DOTS surpasses SOTA models and generalizes to out-of-domain benchmarks without re-tuning. Code is available at https://github.com/chaohaoyuan/DoTS.
Problem

Research questions and friction points this paper is trying to address.

SFT
RLVR
task vectors
integration
catastrophic forgetting
Innovation

Methods, ideas, or system contributions that make the work stand out.

task vector
test-time synthesis
decoupled integration
Bayesian optimization
sparsification