NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large-scale recommendation model training at the thousand-GPU/NPU scale, embedding lookups and communication latency emerge as primary performance bottlenecks. This work proposes NestPipe, a novel framework featuring a dual-level nested pipelining mechanism: a cross-batch, staleness-free Double-Buffered Pipeline (DBP) that preserves synchronous semantics, and a Frozen-Window Pipeline (FWP) that exploits the phenomenon of embedding freezing to enhance efficiency. By integrating double-buffered synchronization, coordinated stream scheduling, key-centric sample clustering, and overlapping of All-to-All communication with dense computation, NestPipe achieves up to a 3.06× speedup and 94.07% scaling efficiency on a 1,536-accelerator cluster.
📝 Abstract
Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents NestPipe, a large-scale decentralized embedding training framework that tackles both bottlenecks while preserving synchronous training semantics. NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining. At the inter-batch level, Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline through dual-buffer synchronization, mitigating lookup bottlenecks without embedding staleness. At the intra-batch level, we identify the embedding freezing phenomenon, which inspires Frozen-Window Pipelining (FWP) to overlap All2All communication with dense computation via coordinated stream scheduling and key-centric sample clustering. Experiments on production GPU and NPU clusters with 1,536 workers demonstrate that NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency.
Problem

Research questions and friction points this paper is trying to address.

large-scale recommendation
embedding lookup
communication latency
distributed training
training consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

nested pipelining
embedding training
synchronous semantics
communication overlap
large-scale recommendation
🔎 Similar Papers
No similar papers found.
Z
Zhida Jiang
JD.com
Z
Zhaolong Xing
JD.com
H
Huichao Chai
Huawei
T
Tianxing Sun
JD.com
Q
Qiang Peng
JD.com
B
Baopeng Yuan
JD.com
J
Jiaxing Wang
JD.com
H
Hua Du
JD.com
Zhixin Wu
Zhixin Wu
Shanghai Jiao Tong University
X
Xuemiao Li
Huawei
Y
Yikui Cao
Huawei
X
Xinyu Liu
Huawei
Yongxiang Feng
Yongxiang Feng
Tsinghua University
Deep learningMEMS
Z
Zhen Chen
JD.com
K
Ke Zhang
JD.com