FedMOA: Federated GRPO for Personalized Reasoning LLMs under Heterogeneous Rewards

๐Ÿ“… 2026-01-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses key challenges in personalizing large language models under federated learning, including reward heterogeneity, multi-objective imbalance, and high training costs. Conventional reinforcement learningโ€“based alignment methods are ill-suited for on-device deployment due to their reliance on standalone critics. To overcome this, the authors propose FedMOA, a novel framework that extends the critic-free Group Relative Policy Optimization (GRPO) to federated multi-objective alignment for the first time. FedMOA introduces hypergradient-based online adaptive weighting of objectives at the client side to stabilize local training, while the server employs a task- and accuracy-aware aggregation strategy that prioritizes high-quality updates. Experiments demonstrate that FedMOA improves accuracy by up to 2.2% over standard federated averaging on mathematical reasoning and code generation tasks, significantly enhancing global performance, personalization capability, and multi-objective balance.

Technology Category

Application Category

๐Ÿ“ Abstract
Group Relative Policy Optimization (GRPO) has recently emerged as an effective approach for improving the reasoning capabilities of large language models through online multi-objective reinforcement learning. While personalization on private data is increasingly vital, traditional Reinforcement Learning (RL) alignment is often memory-prohibitive for on-device federated learning due to the overhead of maintaining a separate critic network. GRPO's critic-free architecture enables feasible on-device training, yet transitioning to a federated setting introduces systemic challenges: heterogeneous reward definitions, imbalanced multi-objective optimization, and high training costs. We propose FedMOA, a federated GRPO framework for multi-objective alignment under heterogeneous rewards. FedMOA stabilizes local training through an online adaptive weighting mechanism via hypergradient descent, which prioritizes primary reasoning as auxiliary objectives saturate. On the server side, it utilizes a task- and accuracy-aware aggregation strategy to prioritize high-quality updates. Experiments on mathematical reasoning and code generation benchmarks demonstrate that FedMOA consistently outperforms federated averaging, achieving accuracy gains of up to 2.2% while improving global performance, personalization, and multi-objective balance.
Problem

Research questions and friction points this paper is trying to address.

Federated Learning
Heterogeneous Rewards
Personalized Reasoning
Multi-objective Alignment
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning
Group Relative Policy Optimization
Multi-objective Alignment
Heterogeneous Rewards
Personalized LLMs
๐Ÿ”Ž Similar Papers
No similar papers found.