🤖 AI Summary
To address the I/O bottleneck and compromised sample randomness arising when training datasets vastly exceed main memory capacity in deep learning, this paper proposes a chunk-based batched random-access memory management framework. Our method partitions data into fixed-size chunks and introduces a deterministic, chunk-level random sampling scheduling protocol—guaranteeing convergence while eliminating fine-grained read overhead. It further enables cooperative asynchronous prefetching across single-node and multi-node settings, overcoming inherent I/O limitations of conventional PyTorch DataLoaders. The lightweight runtime design ensures full backward compatibility with existing DataLoader APIs. Experiments demonstrate up to 4.57× end-to-end training speedup, 3.8× higher data loading throughput, significantly reduced GPU idle time, and validated scalability and effectiveness in large-scale distributed training on up to one thousand GPUs.
📝 Abstract
This paper propose Brand, a comprehensive memory management system for deep learning training (DLT) where the memory capacity is much smaller than the size of the training datasets. Brand starts with a bold design choice that data files are always read from disk in batch, named chunk. Based on this assumption, we propose efficient data access protocol in both single-node setting and distributed environment with multiple nodes. The protocol minimizes the wasted data read due to larger granularity, enables efficient inter-node prefetching, while still ensuring randomness required by DLT. The experimental results indicate that Brand can significantly accelerate data fetching in DLT, achieving up to a 4.57x improvement in end-to-end training compared to PyTorch.