Tailored Primitive Initialization is the Secret Key to Reinforcement Learning

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Reinforcement learning (RL) for enhancing large language model (LLM) reasoning suffers from low sampling efficiency and high sensitivity to initialization—leading to substantial performance variance across models under identical RL training. To address this, we propose Tailor, a framework that automatically discovers and constructs diverse, high-quality reasoning primitives via token-level coverage analysis of reasoning traces. Tailor establishes an RL-oriented, customized supervised fine-tuning and data filtering pipeline to enable robust RL warm-starting. Its core innovation lies in explicitly modeling reasoning structure as a learnable, composable set of primitives, thereby significantly broadening the coverage of the initial reasoning state distribution. Evaluated on mathematical and logical reasoning benchmarks, Tailor-generated initialization data accelerates RL convergence and improves stability, yielding an average 23.6% gain in final reasoning accuracy. This effectively mitigates both initialization dependency and sample inefficiency in RL-based reasoning optimization.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). While RL has demonstrated substantial performance gains, it still faces key challenges, including low sampling efficiency and a strong dependence on model initialization: some models achieve rapid improvements with minimal RL steps, while others require significant training data to make progress. In this work, we investigate these challenges through the lens of reasoning token coverage and argue that initializing LLMs with diverse, high-quality reasoning primitives is essential for achieving stable and sample-efficient RL training. We propose Tailor, a finetuning pipeline that automatically discovers and curates novel reasoning primitives, thereby expanding the coverage of reasoning-state distributions before RL. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that Tailor generates more diverse and higher-quality warm-start data, resulting in higher downstream RL performance.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement learning faces low sampling efficiency in language models

Model initialization strongly affects reasoning performance in RL

Limited reasoning token coverage hinders stable RL training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically discovers novel reasoning primitives

Curates high-quality warm-start data

Expands reasoning-state distributions before RL

🔎 Similar Papers

No similar papers found.