How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the cold-start stagnation commonly encountered in post-training reasoning models that rely solely on output-level supervision, where low initial success rates hinder effective learning. We introduce Tsallis entropy into reasoning model training for the first time, proposing a family of losses $ \mathcal{J}_q $ based on the Tsallis q-logarithm that continuously interpolates between reinforcement learning and latent trajectory likelihood estimation. This formulation enables accelerated escape from cold-start regimes through a mechanism that aligns gradient directions while allowing tunable magnitudes. We characterize the trade-off governed by the q parameter between escape speed and susceptibility to noisy memorization, and develop two unbiased Monte Carlo gradient estimators. Combined with Gradient-Amplified Reinforcement Learning (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), our approach substantially alleviates cold-start issues on FinQA, HotPotQA, and MuSiQue, achieving 47.9 maj@16 on HotPotQA—an improvement of 14.4 over GRPO.

📝 Abstract

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{θ^{-q}}$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $Ω(\frac{1}{p_0})$ time to escape cold start, while the density-estimation pole escapes in $Θ\big(\log(\frac{1}{p_0})\big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_θ$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $O\big(\frac{q}{M P_θ^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q{=}0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q{=}0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).

Problem

Research questions and friction points this paper is trying to address.

cold-start stalling

reinforcement learning from verifiable rewards

reasoning models

output-level supervision

initial success probability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tsallis loss

cold-start mitigation

gradient amplification