🤖 AI Summary
Standard reinforcement learning often enhances the single-attempt accuracy (pass@1) of large language models while simultaneously suppressing response diversity across multiple attempts, thereby limiting exploratory capability. To address this trade-off, this work introduces mutual information skill learning (MISL) into large language model training for the first time, proposing a token-level mutual information reward mechanism integrated with Group Relative Policy Optimization (GRPO) to improve pass@k performance. The approach significantly boosts multi-attempt success rates without compromising pass@1 accuracy. Experiments on GSM8K demonstrate an average pass@k improvement of approximately 3% for models such as Llama 3.1-8B and Qwen 2.5-7B, validating the effective alignment between the mutual information objective and tangible performance gains.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.