Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose reasoning models face significant challenges in reinforcement learning (RL) training due to cross-domain heterogeneity—particularly volatile response lengths and verification latency, leading to training instability, difficulty in curriculum design, and high sensitivity to hyperparameters. Method: We propose a domain-sequential RL training paradigm that decouples heterogeneous tasks into staged, reusable intra-domain RLVR (Reinforcement Learning with Verification and Refinement) pipelines. We introduce a novel cascaded RL architecture and incorporate RLHF pre-alignment—not merely for preference optimization but to explicitly enhance reasoning capabilities. Our approach integrates multi-stage curriculum learning with transparent data curation and training recipes. Contribution/Results: The resulting 14B model surpasses DeepSeek-R1-0528 on LiveCodeBench v5/v6/Pro and achieves a silver medal in the 2025 International Olympiad in Informatics (IOI), demonstrating strong effectiveness and generalization on complex reasoning tasks.

Technology Category

Application Category

📝 Abstract
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
Problem

Research questions and friction points this paper is trying to address.

Addresses cross-domain heterogeneity in RL for reasoning models
Reduces engineering complexity via sequential domain-wise RL
Enhances reasoning beyond preference optimization with RLHF
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded domain-wise RL for heterogeneous reasoning tasks
Sequential RL stages maintain or improve benchmark performance
RLHF pre-step enhances reasoning beyond preference optimization
🔎 Similar Papers
No similar papers found.