MinatoLoader: Accelerating Machine Learning Training Through Efficient Data Preprocessing

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In ML training, heterogeneous data preprocessing causes imbalanced batch construction, leading to GPU idleness and head-of-line blocking. This paper proposes a dynamic batching data loader featuring a sample-processing-time-aware scheduling strategy: it prioritizes fast-to-process samples for immediate batch filling while prefetching slow samples in parallel; integrated with continuous background prefetching and native PyTorch compatibility, it supports multi-GPU single-node deployment. On a 4-GPU A100 server, it achieves an average 3.6× (up to 7.5×) speedup over PyTorch’s DataLoader, raises GPU utilization from 46.4% to 90.45%, and preserves model accuracy. Its core innovation lies in the first incorporation of fine-grained, sample-level processing time modeling into dynamic batch construction—effectively mitigating I/O–computation mismatch.

Technology Category

Application Category

📝 Abstract
Data loaders are used by Machine Learning (ML) frameworks like PyTorch and TensorFlow to apply transformations to data before feeding it into the accelerator. This operation is called data preprocessing. Data preprocessing plays an important role in the ML training workflow because if it is inefficiently pipelined with the training, it can yield high GPU idleness, resulting in important training delays. Unfortunately, existing data loaders turn out to waste GPU resources, with $76%$ GPU idleness when using the PyTorch data loader, for example. One key source of inefficiency is the variability in preprocessing time across samples within the same dataset. Existing data loaders are oblivious to this variability, and they construct batches without any consideration of slow or fast samples. In this case, the entire batch is delayed by a single slow sample, stalling the training pipeline and resulting in head-of-line blocking. To address these inefficiencies, we present MinatoLoader, a general-purpose data loader for PyTorch that accelerates training and improves GPU utilization. MinatoLoader is designed for a single-server setup, containing multiple GPUs. It continuously prepares data in the background and actively constructs batches by prioritizing fast-to-preprocess samples, while slower samples are processed in parallel. We evaluate MinatoLoader on servers with V100 and A100 GPUs. On a machine with four A100 GPUs, MinatoLoader improves the training time of a wide range of workloads by up to $7.5 imes$ ($3.6 imes$ on average) over PyTorch DataLoader and Pecan, and up to $3 imes$ ($2.2 imes$ on average) over DALI. It also increases average GPU utilization from 46.4% with PyTorch to 90.45%, while preserving model accuracy and enabling faster convergence.
Problem

Research questions and friction points this paper is trying to address.

Addresses GPU idleness from inefficient data preprocessing
Solves head-of-line blocking caused by variable sample processing times
Improves training speed and GPU utilization in ML frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritizes fast-to-preprocess samples dynamically
Processes slow samples in parallel background
Reduces GPU idleness through intelligent batching
🔎 Similar Papers
No similar papers found.
R
Rahma Nouaji
McGill University
S
Stella Bitchebe
McGill University
Oana Balmau
Oana Balmau
Assistant Professor, McGill University, School of Computer Science
Computer SystemsStorageOperating Systems
R
Ricardo Macedo
INESC TEC & University of Minho