PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the challenges of high communication overhead and client drift caused by heterogeneous data in federated reinforcement learning with verifiable rewards. The authors propose a novel approach that integrates low-rank adaptation (LoRA) fine-tuning, off-policy updates, and anchoring via a limited public dataset. Clients train locally on private data while selectively replacing erroneous responses using a small shared public dataset, thereby exchanging response-level signals to align with global objectives without compromising privacy. This strategy substantially reduces communication costs and enhances cross-client coordination. Experimental results demonstrate consistent improvements over standard baselines on mathematical and medical reasoning benchmarks, validating the effectiveness of combining low-rank adaptation with limited public data for federated alignment.

Technology Category

Application Category

📝 Abstract

Reasoning post-training with reinforcement learning from verifiable rewards (RLVR) is typically studied in centralized settings, yet many realistic applications involve decentralized private data distributed across organizations. Federated training is a natural solution, but scaling RLVR in this regime is challenging: full-model synchronization is expensive, and performing many local steps can cause severe client drift under heterogeneous data. We propose a federated RLVR framework that combines LoRA-based local adaptation with public-data-based off-policy steps to improve both communication efficiency and cross-client coordination. In particular, a small shared public dataset is used to periodically exchange and reuse response-level training signals across organizations, providing a lightweight anchor toward a more globally aligned objective without exposing private data. Our method selectively replaces locally incorrect responses with globally correct ones during public-data steps, thereby keeping training closer to the local policy while still benefiting from cross-client coordination. Across mathematical and medical reasoning benchmarks and models, our method consistently improves over standard baselines. Our results highlight a simple and effective recipe for federated reasoning post-training: combining low-rank communication with limited public-data coordination.

Problem

Research questions and friction points this paper is trying to address.

Federated RLVR

private data

client drift

communication efficiency

cross-client coordination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated RLVR

LoRA

public-data coordination