π€ AI Summary
This work addresses the data redundancy, storage overhead, and I/O bottlenecks caused by ultra-long user interaction histories in recommendation systems under the βFat Rowβ paradigm, particularly the resource inefficiencies in multi-tenant environments. To tackle these challenges, the authors propose a versioned lazy materialization paradigm that stores interaction histories in an immutable, normalized layer and reconstructs sequences on-the-fly during training via lightweight version pointers. The approach integrates decoupled preprocessing, pipelined I/O prefetching, and data affinity optimizations to ensure GPU-compute-bound training. It further introduces novel support for multi-dimensional projection pushdown and a dual-channel protocol to prevent future leakage, thereby eliminating redundancy while preserving online-offline consistency. Deployed in production, this method significantly reduces infrastructure costs, enables training with longer sequences, and improves model quality.
π Abstract
Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.