AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

To address the low modularity and high scalability cost of large-model training systems on heterogeneous hardware, this paper proposes a modular training system architecture characterized by high cohesion and low coupling. Our approach strictly encapsulates component interfaces, introduces a lines-of-code-based metric for quantifying module complexity, and integrates a heterogeneous-resource-aware scheduling mechanism—enabling constant-complexity system growth (e.g., integrating RoPE requires only ~10 lines of code across hundreds of modules). Experiments demonstrate that the system achieves training performance comparable to mainstream frameworks while significantly improving modularity, accelerating feature integration by an order of magnitude, and substantially reducing maintenance overhead. This work establishes a novel, evolution-friendly engineering paradigm for large-scale deep learning systems.

Technology Category

Application Category

📝 Abstract

We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via Lines-of-Code (LoC)-complexity, which demonstrates how our system maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in other systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn.

Problem

Research questions and friction points this paper is trying to address.

Enables scalable training of large deep learning models

Focuses on modularity and heterogeneous hardware support

Reduces code complexity for integrating new features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular design for scalable model training

Heterogeneous hardware infrastructure support

Low-code integration across hundreds modules

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models