Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high computational cost of reinforcement learning with verifiable rewards (RLVR) for long-chain reasoning tasks—primarily induced by lengthy context windows—this paper introduces ThinkFree, a novel strategy initialization method. ThinkFree explicitly discards non-essential intermediate reasoning steps during multi-stage training, retaining only critical inference steps, and enforces context truncation by appending a `</think>` token at the output’s end. This lightweight, architecture-agnostic approach enables efficient scaling from short to long contexts, accelerating convergence and elevating performance ceilings. Empirically, a 4B-parameter model trained under ThinkFree achieves 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench—using only 4K H20 GPU-hours—demonstrating substantial reductions in computational overhead. ThinkFree establishes a new paradigm for efficient RLVR training in long-context reasoning.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct *</think>* append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs of RLVR training with long contexts
Prevents irreversible performance degradation from short context initialization
Improves reasoning efficiency without complex rewards or training designs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Thinking-Free Policy Initialization reduces inference token usage
ThinkFree operation discards thinking content via special token
TFPI accelerates convergence and improves reasoning model efficiency
🔎 Similar Papers
X
Xin Xu
LLM Department, Tencent
C
Cliveb AI
LLM Department, Tencent
K
Kai Yang
LLM Department, Tencent
Tianhao Chen
Tianhao Chen
Phd student, Zhejiang University
Geotechnical engineering
Y
Yang Wang
The University of Hong Kong
S
Saiyong Yang
LLM Department, Tencent
Can Yang
Can Yang
Hong Kong University of Science and Technology
Statistical Machine LearningStatistical Genetics and Genomics