ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the high activation memory overhead and limited throughput in large language model pretraining, where existing low-rank or sparse approaches struggle to balance efficiency and performance. The authors propose ELAS, a novel framework that, for the first time, integrates squared ReLU activations into low-rank feedforward networks and applies 2:4 structured activation sparsity—rather than conventional weight sparsity—to their outputs. By leveraging modern GPUs’ native support for structured sparsity, ELAS achieves nearly lossless performance across LLaMA models ranging from 60M to 1B parameters, substantially reduces activation memory consumption, and enhances both training and inference efficiency, particularly in large-batch settings.

📝 Abstract

Large Language Models (LLMs) have achieved remarkable capabilities, but their immense computational demands during training remain a critical bottleneck for widespread adoption. Low-rank training has received attention in recent years due to its ability to significantly reduce training memory usage. Meanwhile, applying 2:4 structured sparsity to weights and activations to leverage NVIDIA GPU support for 2:4 structured sparse format has become a promising direction. However, existing low-rank methods often leave activation matrices in full-rank, which dominates memory consumption and limits throughput during large-batch training. Furthermore, directly applying sparsity to weights often leads to non-negligible performance degradation. To achieve efficient pre-training of LLMs, this paper proposes ELAS: Efficient pre-training of Low-rank LLMs via 2:4 Activation Sparsity, a novel framework for low-rank models via 2:4 activation sparsity. ELAS applies squared ReLU activation functions to the feed-forward networks in low-rank models and implements 2:4 structured sparsity on the activations after the squared ReLU operation. We evaluated ELAS through pre-training experiments on LLaMA models ranging from 60M to 1B parameters. The results demonstrate that ELAS maintains performance with minimal degradation after applying 2:4 activation sparsity, while achieving training and inference acceleration. Moreover, ELAS reduces activation memory overhead, particularly with large batch sizes. Code is available at ELAS Repo.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Low-Rank Training

Activation Sparsity

2:4 Structured Sparsity

Memory Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank training

2:4 structured sparsity

activation sparsity