EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse rewards in reinforcement learning impede effective exploration—particularly on challenging problems where low rollout accuracy hinders learning. Existing approaches rely on stronger LLM distillation or filtering of difficult instances, compromising scalability and reasoning capability gains. Method: We propose EvoCoT, a two-stage chain-of-thought optimization framework for self-evolving curriculum learning. It constrains the exploration space via self-generated, verifiable reasoning trajectories and progressively compresses solution paths to enable controllable exploration expansion—achieving fully autonomous, annotation-free curriculum design. Contribution/Results: EvoCoT is compatible with mainstream RL fine-tuning methods and supports major open-weight models (e.g., Qwen, DeepSeek, Llama). Experiments demonstrate substantial improvements in mathematical and logical reasoning performance, enabling large language models to solve previously intractable complex problems while preserving scalability and generalization.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on stronger LLMs for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens them to expand the space in a controlled way. This enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.
Problem

Research questions and friction points this paper is trying to address.

Overcoming sparse rewards in RLVR for LLMs
Addressing exploration bottlenecks in reasoning tasks
Enhancing scalability without external supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving curriculum learning framework
Two-stage chain-of-thought reasoning optimization
Controlled expansion of exploration space
🔎 Similar Papers
H
Huanyu Liu
School of Computer Science, Peking University
J
Jia Li
College of AI, Tsinghua University
C
Chang Yu
School of Computer Science, Peking University
Taozhi Chen
Taozhi Chen
Imperial College London
Code GenerationMachine LearningData Science
Yihong Dong
Yihong Dong
Peking University
Code GenerationLarge Language Models
L
Lecheng Wang
School of Computer Science, Peking University
H
Hu XiaoLong
new h3c technologies co., ltd
Ge Li
Ge Li
Full Professor of Computer Science, Peking University
Program AnalysisProgram GenerationDeep Learning