EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Sparse rewards in reinforcement learning impede effective exploration—particularly on challenging problems where low rollout accuracy hinders learning. Existing approaches rely on stronger LLM distillation or filtering of difficult instances, compromising scalability and reasoning capability gains. Method: We propose EvoCoT, a two-stage chain-of-thought optimization framework for self-evolving curriculum learning. It constrains the exploration space via self-generated, verifiable reasoning trajectories and progressively compresses solution paths to enable controllable exploration expansion—achieving fully autonomous, annotation-free curriculum design. Contribution/Results: EvoCoT is compatible with mainstream RL fine-tuning methods and supports major open-weight models (e.g., Qwen, DeepSeek, Llama). Experiments demonstrate substantial improvements in mathematical and logical reasoning performance, enabling large language models to solve previously intractable complex problems while preserving scalability and generalization.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on stronger LLMs for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens them to expand the space in a controlled way. This enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.

Problem

Research questions and friction points this paper is trying to address.

Overcoming sparse rewards in RLVR for LLMs

Addressing exploration bottlenecks in reasoning tasks

Enhancing scalability without external supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving curriculum learning framework

Two-stage chain-of-thought reasoning optimization

Controlled expansion of exploration space

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL