Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard GRPO often suffers from optimization imbalance in multi-task settings, leading to stagnation in the performance of weaker tasks and hindering reliable cross-task generalization. To address this, this work proposes Multi-Task GRPO, which introduces a dynamic task weighting mechanism to explicitly prioritize the optimization of the weakest task. Additionally, a proportion-preserving sampler is designed to ensure that gradient updates accurately reflect the adjusted task weights. Empirical results demonstrate that the proposed method improves the accuracy of the weakest task by 16–28% over standard GRPO and by 6% over DAPO in both 3-task and 9-task configurations. Furthermore, it reduces the training steps required to reach 50% weakest-task accuracy by half, substantially enhancing both multi-task balance and training efficiency.

Technology Category

Application Category

📝 Abstract
RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
Problem

Research questions and friction points this paper is trying to address.

multi-task learning
GRPO
large language models
optimization imbalance
zero advantage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Task GRPO
worst-task optimization
ratio-preserving sampler
balanced multi-task learning
reliable LLM reasoning
🔎 Similar Papers
No similar papers found.