Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

In reinforcement learning–based self-improvement of large language models (LLMs), uniform exploration budget allocation leads to redundant exploration on simple tasks and vanishing gradients on difficult ones—particularly evident in GRPO, where zero-gradient rates are high. To address this, this work formulates task-level exploration as a knapsack problem and proposes a learning-state–aware adaptive budget allocation mechanism. The method dynamically allocates computational resources based on the ratio of exploration value to computational cost per task. On mathematical reasoning benchmarks, it reduces the zero-gradient rate by 20–40%, improves average performance by 2–4 points (up to +9), and achieves comparable results with ~50% less computational overhead. The core innovation lies in casting resource allocation as a learnable combinatorial optimization problem, thereby significantly enhancing training efficiency and out-of-distribution generalization.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.

Problem

Research questions and friction points this paper is trying to address.

Optimizing exploration budget allocation for LLM self-improvement

Addressing zero-gradient issues in Group Relative Policy Optimization

Adaptively distributing computational resources to challenging tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes budget allocation via knapsack problem formulation

Adaptively distributes resources based on learning status

Increases non-zero policy gradients by 20-40% during training

🔎 Similar Papers

EVOLvE: Evaluating and Optimizing LLMs For Exploration