UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard reinforcement learning often enhances the single-attempt accuracy (pass@1) of large language models while simultaneously suppressing response diversity across multiple attempts, thereby limiting exploratory capability. To address this trade-off, this work introduces mutual information skill learning (MISL) into large language model training for the first time, proposing a token-level mutual information reward mechanism integrated with Group Relative Policy Optimization (GRPO) to improve pass@k performance. The approach significantly boosts multi-attempt success rates without compromising pass@1 accuracy. Experiments on GSM8K demonstrate an average pass@k improvement of approximately 3% for models such as Llama 3.1-8B and Qwen 2.5-7B, validating the effective alignment between the mutual information objective and tangible performance gains.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.
Problem

Research questions and friction points this paper is trying to address.

response diversity
large language models
reinforcement learning
reasoning tasks
multi-attempt correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mutual Information Skill Learning
pass@k optimization
response diversity
Group Relative Policy Optimization
token-level reward
🔎 Similar Papers
No similar papers found.
D
Devan Shah
Princeton University
O
Owen Yang
Princeton University
D
Daniel Yang
Princeton University
Chongyi Zheng
Chongyi Zheng
Princeton University
Reinforcement LearningMachine Learning
Benjamin Eysenbach
Benjamin Eysenbach
Princeton University
Reinforcement Learning