๐ค AI Summary
This work addresses the high computational and memory costs incurred by large language models when processing long prompts, stemming from the quadratic complexity of self-attention. While existing compression methods operate solely in token space and overlook redundancy in the embedding space, this paper introduces K-Token Mergingโa novel framework that, for the first time, merges every K consecutive tokens into a single embedding within the latent embedding space via a lightweight encoder. The compressed representation is then processed by a LoRA-finetuned large language model, while generation still employs the original vocabulary. By transcending conventional token-space compression, the method achieves highly efficient input-length reduction with minimal performance degradation. It establishes a Pareto frontier between compression ratio and task performance on Textualized Tree, Amazon Reviews, and CommitPackFT benchmarks, attaining up to 75% compression with negligible loss in accuracy.
๐ Abstract
Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.