Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

229K/year
πŸ€– AI Summary
Existing unified multitask reinforcement learning approaches employ a fixed curriculum and a shared policy across all programming tasks, struggling to account for the distinct characteristics of individual tasks. To address this limitation, this work proposes the ASTOR framework, which introduces task utility signals to dynamically assess each task’s learning potential and cross-task synergies. ASTOR employs a hierarchical utility routing mechanism to schedule training data at multiple levels and adaptively adjusts KL regularization constraints to optimize policy updates. Experiments on two prominent code large language models across four programming tasks demonstrate that a single ASTOR model consistently outperforms both task-specific models (by 9.0%–9.5%) and the strongest multitask baseline (by 7.5%–12.8%).
πŸ“ Abstract
Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. However, existing MTRL methods treat all coding tasks uniformly, relying on fixed data curricula under a shared optimization strategy, ultimately limiting the effectiveness of multi-task training. To address these limitations, we propose ASTOR, a multi-tASk code reinforcement learning framework via uTility-driven coORdination. Centered on task utility, a signal capturing each task learning potential and cross-task synergy, ASTOR comprises two coupled modules: 1) Hierarchical Utility-Routed Data Scheduling module hierarchically allocates training budget and prioritizes informative prompts, steering training toward the most valuable data and 2) Adaptive Utility-Calibrated Policy Optimization module dynamically scales per-task KL regularization, matching update constraints to each tasks current training state. Experiments on two widely-used LLMs across four representative coding tasks demonstrate that ASTOR consistently improves a single model across all tasks, outperforming the best task-specific specialist by 9.0%-9.5% and surpassing the strongest MTRL baseline by 7.5%-12.8%.
Problem

Research questions and friction points this paper is trying to address.

multi-task reinforcement learning
code LLMs
task utility
data scheduling
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-task reinforcement learning
task utility
data scheduling
adaptive KL regularization
code LLMs
πŸ”Ž Similar Papers