Better Prompt Compression Without Multi-Layer Perceptrons

📅 2025-01-12

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the inefficiency of existing prompt compression methods for language models, which rely on complex, multi-layer neural networks. We propose the Attention-Only Compressor (AOC), a lightweight encoder comprising solely self-attention layers—omitting all MLP components. To our knowledge, this is the first study to theoretically and empirically demonstrate that prompt compression encoders need not replicate the full decoder architecture. Removing MLP layers reduces parameter count by 67%, while maintaining superior reconstruction quality over LoRA baselines even at an extreme 480× compression ratio. By integrating LoRA adaptation and prompt regeneration optimization, AOC significantly improves both prompt reconstruction accuracy and inference speed across diverse compression ratios. Our approach establishes a novel architectural paradigm for efficient, parameter-lightweight prompt compression.

Technology Category

Application Category

📝 Abstract

Prompt compression is a promising approach to speeding up language model inference without altering the generative model. Prior works compress prompts into smaller sequences of learned tokens using an encoder that is trained as a LowRank Adaptation (LoRA) of the inference language model. However, we show that the encoder does not need to keep the original language model's architecture to achieve useful compression. We introduce the Attention-Only Compressor (AOC), which learns a prompt compression encoder after removing the multilayer perceptron (MLP) layers in the Transformer blocks of a language model, resulting in an encoder with roughly 67% less parameters compared to the original model. Intriguingly we find that, across a range of compression ratios up to 480x, AOC can better regenerate prompts and outperform a baseline compression encoder that is a LoRA of the inference language model without removing MLP layers. These results demonstrate that the architecture of prompt compression encoders does not need to be identical to that of the original decoder language model, paving the way for further research into architectures and approaches for prompt compression.

Problem

Research questions and friction points this paper is trying to address.

Language Model Compression

Prompt Engineering

Efficiency Enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

AOC

Prompt Compression

Parameter Reduction

🔎 Similar Papers

From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression