ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
📝 Abstract
The rapid scaling of Large Language Models presents significant challenges for their deployment and inference, particularly on resource-constrained specialized AI hardware accelerators such as Huawei's Ascend NPUs, where weight data transfer has become a critical performance bottleneck. While lossless compression can preserve model accuracy and reduce data volume, existing lossless compression algorithms exhibit extremely low throughput when ported to the Ascend NPU architecture. In this paper, we propose ENEC, a novel lossless compression method specifically customized for AI model weights and optimized for Ascend Neural Processing Units. ENEC adopts a block-based fixed-length encoding scheme and incorporates a series of NPU-specific optimizations: bit-width quantization with hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for efficient prefix-sum computation. Experimental results demonstrate that ENEC outperforms existing state-of-the-art NPU compressors in both compression ratio and throughput. Compared to leading GPU solutions, ENEC achieves a 3.43X higher throughput than DietGPU and a 1.12X better compression ratio than nvCOMP. By reducing weight transmission overhead, ENEC significantly improves end-to-end inference performance, achieving up to a 6.3X speedup. On Ascend NPUs, ENEC is the first open-source lossless compression algorithm for model weights that achieves performance comparable to state-of-the-art GPU compressors, offering an effective solution for deploying large-scale AI models.
Problem

Research questions and friction points this paper is trying to address.

model compression
lossless compression
Ascend NPU
weight transmission
inference acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

lossless compression
Ascend NPU
model weight optimization
fixed-length encoding
prefix-sum acceleration
🔎 Similar Papers
No similar papers found.
J
Jinwu Yang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
J
Jiaan Wu
University of Chinese Academy of Sciences, Beijing, China
Z
Zedong Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
X
Xinyang Ma
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
H
Hairui Zhao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Y
Yida Gu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Y
Yuanhong Huang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
X
Xingchen Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Wenjing Huang
Wenjing Huang
RAND Corporation
PsychometricsStructural Equation ModelingItem Response TheoryCyber Security
Z
Zheng Wei
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Jing Xing
Jing Xing
Lingang Laboratory
Drug Discovery Data Mining
Y
Yili Ma
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Q
Qingyi Zhang
Huawei Technologies Co., Ltd., Shenzhen, Guangdong, China
B
Baoyi An
Huawei Technologies Co., Ltd., Shenzhen, Guangdong, China
Z
Zhongzhe Hu
Huawei Technologies Co., Ltd., Shenzhen, Guangdong, China
S
Shaoteng Liu
Huawei Technologies Co., Ltd., Shenzhen, Guangdong, China
X
Xia Zhu
Huawei Technologies Co., Ltd., Shenzhen, Guangdong, China
Jiaxun Lu
Jiaxun Lu
Huawei Technologies
Network4AIMultimodal LearningInformation TheoryCommunication Technology
G
Guangming Tan
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Dingwen Tao
Dingwen Tao
Chinese Academy of Sciences, IEEE/ACM Senior Member
High Performance ComputingData ReductionDeep LearningSystems for MLGPU