FedMOA: Federated GRPO for Personalized Reasoning LLMs under Heterogeneous Rewards

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses key challenges in personalizing large language models under federated learning, including reward heterogeneity, multi-objective imbalance, and high training costs. Conventional reinforcement learning–based alignment methods are ill-suited for on-device deployment due to their reliance on standalone critics. To overcome this, the authors propose FedMOA, a novel framework that extends the critic-free Group Relative Policy Optimization (GRPO) to federated multi-objective alignment for the first time. FedMOA introduces hypergradient-based online adaptive weighting of objectives at the client side to stabilize local training, while the server employs a task- and accuracy-aware aggregation strategy that prioritizes high-quality updates. Experiments demonstrate that FedMOA improves accuracy by up to 2.2% over standard federated averaging on mathematical reasoning and code generation tasks, significantly enhancing global performance, personalization capability, and multi-objective balance.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) has recently emerged as an effective approach for improving the reasoning capabilities of large language models through online multi-objective reinforcement learning. While personalization on private data is increasingly vital, traditional Reinforcement Learning (RL) alignment is often memory-prohibitive for on-device federated learning due to the overhead of maintaining a separate critic network. GRPO's critic-free architecture enables feasible on-device training, yet transitioning to a federated setting introduces systemic challenges: heterogeneous reward definitions, imbalanced multi-objective optimization, and high training costs. We propose FedMOA, a federated GRPO framework for multi-objective alignment under heterogeneous rewards. FedMOA stabilizes local training through an online adaptive weighting mechanism via hypergradient descent, which prioritizes primary reasoning as auxiliary objectives saturate. On the server side, it utilizes a task- and accuracy-aware aggregation strategy to prioritize high-quality updates. Experiments on mathematical reasoning and code generation benchmarks demonstrate that FedMOA consistently outperforms federated averaging, achieving accuracy gains of up to 2.2% while improving global performance, personalization, and multi-objective balance.

Problem

Research questions and friction points this paper is trying to address.

Federated Learning

Heterogeneous Rewards

Personalized Reasoning

Multi-objective Alignment

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning

Group Relative Policy Optimization

Multi-objective Alignment