NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

πŸ“… 2025-04-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address computational and memory bottlenecks in deploying large language models (LLMs) under resource-constrained settings, this paper introduces NoWagβ€”the first zero-shot, shape-preserving unified compression framework supporting both vector quantization (VQ) and sparse pruning. By uncovering shared mechanistic principles between normalized weights and activations across these two paradigms, NoWag enables cross-paradigm unified modeling without fine-tuning, preserving the original model architecture and interface. It comprises two components: zero-shot vector quantization (NoWag-VQ) and zero-shot unstructured/semi-structured pruning (NoWag-P). Evaluated on Llama-2 (7B/13B/70B) and Llama-3 (8B/70B), NoWag-VQ substantially outperforms existing zero-shot VQ methods, while NoWag-P achieves pruning performance competitive with state-of-the-art approaches. The implementation is publicly available.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag: (Normalized Weight and Activation Guided Compression), a unified framework for zero-shot shape preserving compression algorithms. We compressed Llama-2 7B/13B/70B and Llama-3 8/70BB models, using two popular forms of shape-preserving compression, vector quantization NoWag-VQ (NoWag for Vector Quantization), and unstructured/semi-structured pruning NoWag-P (NoWag for Pruning). We found that NoWag-VQ significantly outperforms state-of-the-art zero shot VQ, and that NoWag-P performs competitively against state-of-the-art methods. These results suggest commonalities between these compression paradigms that could inspire future work. Our code is available at https://github.com/LawrenceRLiu/NoWag
Problem

Research questions and friction points this paper is trying to address.

Reducing computational and memory demands of large language models
Developing shape-preserving compression for resource-constrained environments
Unifying vector quantization and pruning for efficient model compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Normalized weight and activation guided compression
Zero-shot shape preserving compression framework
Vector quantization and pruning for LLM compression
πŸ”Ž Similar Papers
No similar papers found.
Lawrence Liu
Lawrence Liu
M,S EE 2026, UCLA
Applied Machine LearningOptimizationReinforcement Learning
I
Inesh Chakrabarti
University of California, Los Angeles
Yixiao Li
Yixiao Li
Georgia Institute of Technology
Machine Learning
M
Mengdi Wang
Princeton University
T
Tuo Zhao
Georgia Institute of Technology
L
Lin F. Yang
University of California, Los Angeles