SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work investigates the energy efficiency impact of offloading expert weights to SSDs in MoE-based large language model inference. Addressing the fundamental trade-off—SSDs offer high capacity and low cost but incur significantly higher read energy than HBM—we propose the first fine-grained energy modeling and system-level analysis framework specifically designed for MoE weight offloading. We quantitatively evaluate multi-tier storage hierarchies (HBM/DDR/SSD) on models including DeepSeek-R1. Results show that current SSD offloading increases per-token generation energy by up to 12×, making it the dominant contributor to total inference energy; prefetching mitigates latency but fails to alleviate the inherent energy overhead. Crucially, we establish—for the first time—that SSD read energy efficiency must improve by an order of magnitude to enable viable offloading. This reveals high-efficiency storage access as a critical bottleneck for green deployment of MoE models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) applying Mixture-of-Experts (MoE) scale to trillions of parameters but require vast memory, motivating a line of research to offload expert weights from fast-but-small DRAM (HBM) to denser Flash SSDs. While SSDs provide cost-effective capacity, their read energy per bit is substantially higher than that of DRAM. This paper quantitatively analyzes the energy implications of offloading MoE expert weights to SSDs during the critical decode stage of LLM inference. Our analysis, comparing SSD, CPU memory (DDR), and HBM storage scenarios for models like DeepSeek-R1, reveals that offloading MoE weights to current SSDs drastically increases per-token-generation energy consumption (e.g., by up to ~12x compared to the HBM baseline), dominating the total inference energy budget. Although techniques like prefetching effectively hide access latency, they cannot mitigate this fundamental energy penalty. We further explore future technological scaling, finding that the inherent sparsity of MoE models could potentially make SSDs energy-viable if Flash read energy improves significantly, roughly by an order of magnitude.

Problem

Research questions and friction points this paper is trying to address.

Analyzes energy impact of offloading MoE weights to SSDs

Compares energy consumption across SSD, DDR, and HBM storage

Explores future SSD viability with improved Flash read energy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes SSD energy impact on MoE offloading

Compares SSD, DDR, HBM for MoE storage

Explores Flash read energy improvements potential

🔎 Similar Papers

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training