ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the substantial performance degradation caused by 2:4 semi-structured pruning in large language model (LLM) deployment, this paper proposes a high-performance pruning method based on adaptive matrix decomposition. The core innovation decomposes each weight matrix into a 2:4-sparse kernel and a block-diagonal transformation matrix, augmented with lightweight pre- and post-correction modules to preserve representational capacity; theoretical analysis proves superior optimization convergence over existing approaches. The method employs block coordinate descent coupled with layer-wise surrogate loss minimization, enabling efficient single-stage post-training pruning. Experiments on Llama and Qwen models demonstrate that, under identical inference speedup and memory savings compared to state-of-the-art 2:4 pruning methods, our approach improves downstream task accuracy by up to 4.2% and reduces perplexity by up to 18.7%, significantly alleviating the accuracy–efficiency trade-off.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy
Problem

Research questions and friction points this paper is trying to address.

Reducing computational and memory requirements of large language models
Minimizing performance degradation in semi-structured pruning methods
Improving trade-off between model compression and task accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

ARMOR factorizes weight matrices into sparse core with wrappers
Uses block coordinate descent to minimize layer-wise proxy loss
Achieves 2:4 sparsity while preserving model quality and speed
🔎 Similar Papers
No similar papers found.
Lawrence Liu
Lawrence Liu
M,S EE 2026, UCLA
Applied Machine LearningOptimizationReinforcement Learning
A
Alexander Liu
University of California, Los Angeles
M
Mengdi Wang
Princeton University
T
Tuo Zhao
Georgia Institute of Technology
L
Lin F. Yang
University of California, Los Angeles