DataStates-LLM: Scalable Checkpointing for Transformer Models Using Composable State Providers

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the inefficiencies in existing checkpointing schemes for large-scale Transformer training, which overlook the three-dimensional heterogeneity of state data—spanning memory placement, logical sharding, and serialization requirements—leading to device-host transfer bottlenecks, inefficient serialization, and I/O contention. To resolve this, the authors propose DataStates-LLM, a novel architecture that introduces, for the first time, a composable state provider mechanism to explicitly model state heterogeneity and decouple state abstraction from data movement. By leveraging parameter invariance during forward and backward passes, it enables non-blocking asynchronous snapshots and co-optimizes metadata serialization with tensor I/O. Evaluated on a 256×A100-40GB system training a 70B-parameter model, the approach achieves up to a 4× improvement in checkpoint throughput and reduces end-to-end training time by as much as 2.2×.

Technology Category

Application Category

📝 Abstract

The rapid growth of Large Transformer-based models, specifically Large Language Models (LLMs), now scaling to trillions of parameters, has necessitated training across thousands of GPUs using complex hybrid parallelism strategies (e.g., data, tensor, and pipeline parallelism). Checkpointing this massive, distributed state is critical for a wide range of use cases, such as resilience, suspend-resume, investigating undesirable training trajectories, and explaining model evolution. However, existing checkpointing solutions typically treat model state as opaque binary blobs, ignoring the ``3D heterogeneity''of the underlying data structures--varying by memory location (GPU vs. Host), number of ``logical''objects sharded and split across multiple files, data types (tensors vs. Python objects), and their serialization requirements. This results in significant runtime overheads due to blocking device-to-host transfers, data-oblivious serialization, and storage I/O contention. In this paper, we introduce DataStates-LLM, a novel checkpointing architecture that leverages State Providers to decouple state abstraction from data movement. DataStates-LLM exploits the immutability of model parameters during the forward and backward passes to perform ``lazy'', non-blocking asynchronous snapshots. By introducing State Providers, we efficiently coalesce fragmented, heterogeneous shards and overlap the serialization of metadata with bulk tensor I/O. We evaluate DataStates-LLM on models up to 70B parameters on 256 A100-40GB GPUs. Our results demonstrate that DataStates-LLM achieves up to 4$\times$ higher checkpointing throughput and reduces end-to-end training time by up to 2.2$\times$ compared to state-of-the-art solutions, effectively mitigating the serialization and heterogeneity bottlenecks in extreme-scale LLM training.

Problem

Research questions and friction points this paper is trying to address.

checkpointing

large language models

distributed training

state heterogeneity

serialization overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

State Providers

asynchronous checkpointing

heterogeneous state management