๐ค AI Summary
This work addresses the high overhead and latency in machine learning training caused by numerous independent GET requests during data loading. The authors propose GetBatch, the first system to treat batched object retrieval as a first-class primitive in storage systems. By replacing multiple independent requests with a single, deterministic, fault-tolerant streaming execution, GetBatch enables efficient and low-latency multi-object fetching. The approach integrates distributed object storage, request aggregation, and streaming transfer, achieving up to 15ร higher throughput for small objects, a 2ร reduction in P95 batch latency in production environments, and a 3.7ร reduction in P99 tail latency.
๐ Abstract
Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.