🤖 AI Summary
Post-training compression of pre-trained large language models (LLMs) on resource-constrained devices remains challenging—particularly when original training data is unavailable and high-rank weight matrices impede conventional low-rank tensor decomposition methods.
Method: We propose Sparse-Augmented Tensor Network (Saten), the first end-to-end differentiable, post-training tensorization framework for full-model compression. Saten integrates tensor-train (TT) network parameterization, structured sparsity regularization, and fine-tuning-aware tensor decomposition.
Contribution/Results: Saten operates without access to pre-training data and overcomes performance bottlenecks of traditional tensor networks under high-rank constraints. Experiments across multiple downstream tasks demonstrate state-of-the-art (SOTA) accuracy—achieving an average +2.3% accuracy gain at equivalent compression ratios, reducing parameter count by 15×, and decreasing inference GPU memory consumption by 82%.
📝 Abstract
The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.