Projected Compression: Trainable Projection for Efficient Transformer Compression

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the rapidly increasing inference overhead of large language models (LLMs) with scale, this paper proposes a lossless compression method based on trainable projections. It learns low-dimensional projection matrices to map original weights into a compact subspace, substantially reducing parameter count while preserving per-token computational cost. Crucially, the projection module remains fully differentiable and does not restrict access to full-parameter representations; after projection fusion, it yields a compact model with zero additional inference overhead. The approach is fully compatible with standard Transformer architectures and requires no custom operators. Experiments demonstrate superior performance over conventional pruning-and-finetuning baselines on high-quality LLMs, with particularly pronounced gains on long-sequence tasks. Notably, the compressed model maintains identical FLOPs to the original, achieving parameter reduction without computational penalty.

Technology Category

Application Category

📝 Abstract
Large language models have steadily increased in size to achieve improved performance; however, this growth has also led to greater inference time and computational demands. Consequently, there is rising interest in model size reduction methods. To address this issue, we propose Projected Compression, a novel model compression technique, that reduces model weights by utilizing projection modules. Specifically, we first train additional trainable projections weights and preserve access to all the original model parameters. Subsequently, these projections are merged into a lower-dimensional product matrix, resulting in a reduced-size standard Transformer-based model. Unlike alternative approaches that require additional computational overhead, our method matches the base model's per-token computation step in FLOPs. Experimental results show that Projected Compression outperforms the comparable hard pruning and retraining approach on higher quality models. Moreover, the performance margin scales well with the number of tokens.
Problem

Research questions and friction points this paper is trying to address.

Reduces Transformer model size via trainable projection modules
Maintains base model FLOPs without extra computational overhead
Outperforms hard pruning and retraining on high-quality models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trainable projection modules reduce model weights
Merge projections into lower-dimensional product matrix
Matches base model FLOPs per token computation
🔎 Similar Papers
No similar papers found.