GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the high overhead and latency in machine learning training caused by numerous independent GET requests during data loading. The authors propose GetBatch, the first system to treat batched object retrieval as a first-class primitive in storage systems. By replacing multiple independent requests with a single, deterministic, fault-tolerant streaming execution, GetBatch enables efficient and low-latency multi-object fetching. The approach integrates distributed object storage, request aggregation, and streaming transfer, achieving up to 15× higher throughput for small objects, a 2× reduction in P95 batch latency in production environments, and a 3.7× reduction in P99 tail latency.

Technology Category

Application Category

📝 Abstract

Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.

Problem

Research questions and friction points this paper is trying to address.

distributed storage

data loading

batch retrieval

object store

machine learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

GetBatch

batch retrieval

object store