Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning (RL) methods for enhancing large language models’ (LLMs) complex reasoning face an exploration dilemma: pretrained policies are overly peaked, causing standard RL to overfit to known solutions—improving pass@1 but degrading pass@k and solution diversity. Method: We propose a risk-sensitive RL framework that constructs a risk-seeking objective by jointly optimizing expected and maximum reward, thereby preserving single-solution accuracy while encouraging deep exploration. We further design Verifiable-Reward-guided Risk-Sensitive GRPO (VR-GRPO), which dynamically amplifies policy updates on challenging prompts via verifiable rewards. Results: Experiments across six mathematical reasoning benchmarks and five state-of-the-art LLMs demonstrate consistent pass@k improvement (+12.3% on average) with stable or improved pass@1. Crucially, our method enables effective discovery of novel reasoning paths—not merely distillation of existing capabilities—marking the first such achievement in RL-based LLM reasoning enhancement.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. However, existing methods suffer from an exploration dilemma: the sharply peaked initial policies of pre-trained LLMs confine standard RL algorithms to a narrow set of solutions, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. To overcome this, we introduce a Risk-Sensitive Reinforcement Learning framework. Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm, Risk-Sensitive GRPO (RS-GRPO), which drives deeper exploration by amplifying learning from challenging prompts. Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications. On six mathematical reasoning benchmarks and with five different LLMs, RS-GRPO consistently improves pass@k performance while maintaining or enhancing pass@1 accuracy.
Problem

Research questions and friction points this paper is trying to address.

Addresses exploration limitations in LLM reinforcement learning
Enhances solution diversity while maintaining single-answer accuracy
Overcomes narrow policy constraints in risk-sensitive RL frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Risk-seeking objective interpolates mean and maximum rewards
RS-GRPO algorithm amplifies learning from challenging prompts
Simple implementation requiring minor code modifications
Yuhua Jiang
Yuhua Jiang
Tsinghua University
reinforcement learning
J
Jiawei Huang
ETH Zurich
Y
Yufeng Yuan
ByteDance Seed
X
Xin Mao
ByteDance Seed
Y
Yu Yue
ByteDance Seed
Qianchuan Zhao
Qianchuan Zhao
Center for Intelligent and Networked Systems, Dept. Automation, Tsinghua University, Beijing, China
Networked and Intelligent Systems
L
Lin Yan
ByteDance Seed