HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory bottleneck faced by large language models (LLMs) on hypernode architectures, where model memory demands far exceed the capacity of a single device’s high-bandwidth memory (HBM). Existing runtime offloading approaches lack global scheduling awareness, leading to exposed communication latency and pipeline stalls. To overcome this, the authors propose HyperOffload, a novel framework that introduces explicit cache operators in the compiler intermediate representation (IR) to model remote memory accesses. By performing global static analysis based on tensor lifetimes and execution dependencies, HyperOffload enables proactive memory scheduling tailored to the hierarchical structure of hypernodes. Implemented via an extension to MindSpore with a dedicated compiler pass for remote memory backends, the approach reduces peak device memory consumption by up to 26% on representative LLM inference tasks while maintaining end-to-end performance.

Technology Category

Application Category

📝 Abstract
The rapid evolution of Large Language Models (LLMs) towards long-context reasoning and sparse architectures has pushed memory requirements far beyond the capacity of individual device HBM. While emerging supernode architectures offer terabyte-scale shared memory pools via high-bandwidth interconnects, existing software stacks fail to exploit this hardware effectively. Current runtime-based offloading and swapping techniques operate with a local view, leading to reactive scheduling and exposed communication latency that stall the computation pipeline. In this paper, we propose the SuperNode Memory Management Framework (\textbf{HyperOffload}). It employs a compiler-assisted approach that leverages graph-driven memory management to treat remote memory access as explicit operations in the computation graph, specifically designed for hierarchical SuperNode architectures. Unlike reactive runtime systems, SuperNode represents data movement using cache operators within the compiler's Intermediate Representation (IR). This design enables a global, compile-time analysis of tensor lifetimes and execution dependencies. Leveraging this visibility, we develop a global execution-order refinement algorithm that statically schedules data transfers to hide remote memory latency behind compute-intensive regions. We implement SuperNode within the production deep learning framework MindSpore, adding a remote memory backend and specialized compiler passes. Evaluation on representative LLM workloads shows that SuperNode reduces peak device memory usage by up to 26\% for inference while maintaining end-to-end performance. Our work demonstrates that integrating memory-augmented hardware into the compiler's optimization framework is essential for scaling next-generation AI workloads.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
SuperNode Architectures
Memory Management
Remote Memory Access
Compiler Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-driven memory management
compiler-assisted offloading
hierarchical memory
SuperNode architecture
static scheduling
🔎 Similar Papers
No similar papers found.
Fangxin Liu
Fangxin Liu
Shanghai Jiao Tong University
In-memory Computing、Brian-inspired Neuromorphic Computing
Q
Qinghua Zhang
Huawei Technologies Co., Ltd., China
H
Hanjing Shen
Shanghai Jiao Tong University, Shanghai, China
Z
Zhibo Liang
Huawei Technologies Co., Ltd., China
Li Jiang
Li Jiang
Shanghai Jiaotong University
Computer Architecture with Emerging Technology and applicationMachine learning for system reliability and optimization
H
Haibing Guan
Shanghai Jiao Tong University, Shanghai, China
Chong Bao
Chong Bao
Zhejiang University
Computer VisionAugmented Reality
X
Xuefeng Jin
Huawei Technologies Co., Ltd., China