Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the prohibitively high computational cost of reinforcement learning with verifiable rewards (RLVR) for long-chain reasoning tasks—primarily induced by lengthy context windows—this paper introduces ThinkFree, a novel strategy initialization method. ThinkFree explicitly discards non-essential intermediate reasoning steps during multi-stage training, retaining only critical inference steps, and enforces context truncation by appending a `</think>` token at the output’s end. This lightweight, architecture-agnostic approach enables efficient scaling from short to long contexts, accelerating convergence and elevating performance ceilings. Empirically, a 4B-parameter model trained under ThinkFree achieves 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench—using only 4K H20 GPU-hours—demonstrating substantial reductions in computational overhead. ThinkFree establishes a new paradigm for efficient RLVR training in long-context reasoning.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct *</think>* append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs of RLVR training with long contexts

Prevents irreversible performance degradation from short context initialization

Improves reasoning efficiency without complex rewards or training designs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Thinking-Free Policy Initialization reduces inference token usage

ThinkFree operation discards thinking content via special token

TFPI accelerates convergence and improves reasoning model efficiency

🔎 Similar Papers

Rational Metareasoning for Large Language Models