UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Standard reinforcement learning often enhances the single-attempt accuracy (pass@1) of large language models while simultaneously suppressing response diversity across multiple attempts, thereby limiting exploratory capability. To address this trade-off, this work introduces mutual information skill learning (MISL) into large language model training for the first time, proposing a token-level mutual information reward mechanism integrated with Group Relative Policy Optimization (GRPO) to improve pass@k performance. The approach significantly boosts multi-attempt success rates without compromising pass@1 accuracy. Experiments on GSM8K demonstrate an average pass@k improvement of approximately 3% for models such as Llama 3.1-8B and Qwen 2.5-7B, validating the effective alignment between the mutual information objective and tangible performance gains.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.

Problem

Research questions and friction points this paper is trying to address.

response diversity

large language models

reinforcement learning

reasoning tasks

multi-attempt correctness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mutual Information Skill Learning

pass@k optimization

response diversity