Chameleon: Taming Dynamic Operator Sequences for Memory-Intensive LLM Training

📅 2025-09-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scaling large language models (LLMs) exacerbates GPU memory pressure, often exceeding HBM capacity; existing memory swapping optimizations assume static computation graphs and thus fail in Eager Mode, where operator sequences dynamically change. Method: We propose Chameleon—the first adaptive memory swapping framework designed specifically for Eager Mode that supports dynamic operator sequences. Its core innovations include a lightweight online profiler, a dynamic swap-policy generation algorithm, and an optimized execution module—jointly enabling real-time memory awareness and scheduling without static graph analysis. Contribution/Results: Experiments show Chameleon reduces profiling overhead by 84.25%, enables training models up to 4× larger than hardware memory capacity, and achieves a 38.94% throughput improvement over compute-heavy or highly parallel baselines. This significantly enhances the trainability of large models under resource-constrained settings.

Technology Category

Application Category

📝 Abstract
The increasing size of large language models (LLMs) has led to a surge in memory requirements during training, often exceeding the capacity of high-bandwidth memory (HBM). Swap-based memory optimization incurs neither accuracy loss nor additional end-to-end overhead when effectively overlapped, thus being an attractive solution. However, existing swap methods assume consistent operator sequences, which is impractical in Eager Mode, where operator sequences can vary during change. We propose Chameleon, which redesigns the end-to-end process of swap-based memory optimization and is the first work to consider varying operator sequences in Eager Mode. Chameleon (i) introduces a lightweight online profiler to enable continuous profiling for monitoring operator sequences, (ii) generates effective swap policies with limited operator information, and (iii) optimizes the policy execution module for accurate policy application and better performance. Experimental results demonstrate that Chameleon reduces profiling overhead by 84.25%, enables training models up to 4x larger than hardware memory while adapting to changes in operator sequences, improves performance by up to 38.94% compared to recomputation or high-degree parallelism.
Problem

Research questions and friction points this paper is trying to address.

Addresses memory-intensive LLM training exceeding HBM capacity
Solves inconsistent operator sequences in Eager Mode training
Reduces swap profiling overhead while enabling larger models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight online profiler for continuous operator monitoring
Generates swap policies with limited operator information
Optimizes policy execution for accurate application
🔎 Similar Papers
No similar papers found.
Z
Zibo Wang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Y
Yuhang Zhou
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Zhibin Wang
Zhibin Wang
Zhejiang University
new particle formationaerosolshygroscopicityblack carbon
S
Shipeng Li
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
X
Xinjing Huang
Huawei Technologies Co., Ltd, Shenzhen, China
C
Chendong Cai
Huawei Technologies Co., Ltd, Shenzhen, China
B
Bingxu Mu
Huawei Technologies Co., Ltd, Shenzhen, China
Y
Yuqing Sun
Huawei Technologies Co., Ltd, Shenzhen, China
Z
Zhiheng Hu
Huawei Technologies Co., Ltd, Shenzhen, China
B
Bin She
Huawei Technologies Co., Ltd, Shenzhen, China
S
Shu You
Huawei Technologies Co., Ltd, Shenzhen, China
G
Guanghuan Fang
Huawei Technologies Co., Ltd, Shenzhen, China
Rong Gu
Rong Gu
Mälardalen University
Formal MethodsMachine LearningAutonomous Systems
W
Wanchun Dou
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Guihai Chen
Guihai Chen
Professor of Computer Science
Computer Science and Technology
Chen Tian
Chen Tian
Prof. of Nanjing University
Data Center NetworkingNetwork Function VirtualisationContent Distribution