Revisiting Parameter Server in LLM Post-Training

📅 2026-01-27

📈 Citations: 1

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the synchronization bottleneck in Fully Sharded Data Parallel (FSDP) during large language model (LLM) post-training, where variable sequence lengths cause inefficient All-Gather and Reduce-Scatter operations, degrading device utilization. To overcome this, the authors introduce a parameter server paradigm into FSDP and propose an On-Demand Communication (ODC) mechanism that replaces collective communication with point-to-point exchanges. ODC enables only one synchronization per mini-batch and supports micro-batch-level dynamic load balancing. Implemented on top of PyTorch FSDP with integrated dynamic load distribution and communication scheduling, ODC significantly improves device utilization and training throughput across diverse LLM post-training tasks, achieving up to a 36% speedup over standard FSDP.

Technology Category

Application Category

📝 Abstract

Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post-training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose \textbf{On-Demand Communication (ODC)}, which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post-training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36\% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post-training. Our implementation of ODC and integration with FSDP is open-sourced at https://github.com/sail-sg/odc.

Problem

Research questions and friction points this paper is trying to address.

large language model

post-training

imbalanced workload

parameter server

collective communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Demand Communication

Parameter Server

FSDP