Cog-Rethinker: Hierarchical Metacognitive Reinforcement Learning for LLM Reasoning

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot reinforcement learning (RL) for large language model (LLM) reasoning relies on static prompts, resulting in low sampling efficiency, excessive invalid outputs, and severe sample waste—particularly for weaker models. Method: We propose Cog-Rethinker, a reasoning-oriented hierarchical metacognitive RL framework that emulates human problem-solving by automatically decomposing problems and iteratively refining answers in a rollout-driven, error-guided manner. Its core design integrates metacognitive mechanisms into hierarchical RL, jointly optimizing supervised fine-tuning and validator-based reward modeling, and introducing a two-stage inference pipeline to ensure robust cold-start capability and training-inference consistency. Results: Evaluated across multiple mathematical reasoning benchmarks, Cog-Rethinker significantly improves sample efficiency and convergence speed, and substantially enhances the zero-shot RL reasoning performance of weaker LLMs.

Technology Category

Application Category

📝 Abstract
Contemporary progress in large language models (LLMs) has revealed notable inferential capacities via reinforcement learning (RL) employing verifiable reward, facilitating the development of O1 and R1-like reasoning models. Directly training from base models with RL is called zero-RL. However, previous works rely upon activating LLMs' inherent capacities through fixed prompt templates. This strategy introduces substantial sampling inefficiencies for weak LLMs, as the majority of problems generate invalid outputs during accuracy-driven filtration in reasoning tasks, which causes a waste of samples. To solve this issue, we propose Cog-Rethinker, a novel hierarchical metacognitive RL framework for LLM reasoning. Our Cog-Rethinker mainly focuses on the rollout procedure in RL training. After the direct rollout, our Cog-Rethinker improves sample utilization in a hierarchical metacognitive two-stage framework. By leveraging human cognition during solving problems, firstly, it prompts policy to decompose zero-accuracy problems into subproblems to produce final reasoning results. Secondly, with zero-accuracy problems in previous rollout stage, it further prompts policy to refine these answers by referencing previous wrong solutions. Moreover, to enable cold-start of the two new reasoning patterns and maintain train-test consistency across prompt templates, our Cog-Rethinker applies supervised fine-tuning on the policy using correct samples of the two stages with direct rollout template. Experimental results demonstrate Cog-Rethinker's superior performance on various mathematical reasoning benchmarks, we also analyzed its improved sample efficiency that accelerates convergence compared to baseline methods.
Problem

Research questions and friction points this paper is trying to address.

Improves sample efficiency in reinforcement learning for LLM reasoning tasks
Addresses invalid output generation during accuracy-driven filtration in weak LLMs
Enables hierarchical metacognitive reasoning through problem decomposition and answer refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical metacognitive RL framework for reasoning
Two-stage decomposition and refinement of problems
Supervised fine-tuning for cold-start and consistency
🔎 Similar Papers
Zexu Sun
Zexu Sun
Renmin University of China
Causal inferenceReinforcement learningLarge language model
Yongcheng Zeng
Yongcheng Zeng
University of Chinese Academy of Sciences
LLMReinforcement Learning
Erxue Min
Erxue Min
University of Manchester, Baidu Inc.
Information RetrievalLarge Language Model
H
Heyang Gao
Gaoling School of Artificial Intelligence, Renmin University of China
B
Bokai Ji
Baidu Inc.
X
Xu Chen
Gaoling School of Artificial Intelligence, Renmin University of China