PocketLLM: Ultimate Compression of Large Language Models via Meta Networks

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address storage and transmission bottlenecks in deploying large language models (LLMs) on edge devices, conventional quantization and pruning methods struggle to balance accuracy and efficiency at high compression ratios. This paper proposes a meta-network-based discrete latent-space compression framework: LLM weights are encoded into discrete latent vectors via a lightweight encoder, then reconstructed using a compact codebook and a small learnable decoder. Only the codebook, decoder parameters, and low-dimensional indices need to be stored—drastically reducing model footprint. Experiments demonstrate that, under a 10× compression ratio, Llama 2-7B incurs negligible accuracy degradation and outperforms state-of-the-art compression approaches. To our knowledge, this is the first method achieving extreme parameter-level compression of LLMs while preserving high fidelity—enabling practical edge deployment without compromising performance.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) continue to grow in size, storing and transmitting them on edge devices becomes increasingly challenging. Traditional methods like quantization and pruning struggle to achieve extreme compression of LLMs without sacrificing accuracy. In this paper, we introduce PocketLLM, a novel approach to compress LLMs in a latent space via meta-networks. A simple encoder network is proposed to project the weights of LLMs into discrete latent vectors, which are then represented using a compact codebook. A lightweight decoder network is employed to map the codebook's representative vectors back to the original weight space. This method allows for significant compression of the large weights in LLMs, consisting solely of a small decoder, a concise codebook, and an index. Extensive experiments show that PocketLLM achieves superior performance even at significantly high compression ratios, e.g., compressing Llama 2-7B by 10x with a negligible drop in accuracy.

Problem

Research questions and friction points this paper is trying to address.

Compressing large language models for edge device deployment

Overcoming accuracy loss in extreme model compression

Reducing storage and transmission requirements for LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses LLMs via meta-networks in latent space

Projects weights into discrete latent vectors using encoder

Reconstructs weights using lightweight decoder and codebook

🔎 Similar Papers

No similar papers found.