Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

๐Ÿ“… 2026-04-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

201K/year
๐Ÿค– AI Summary
This work addresses the high computational and memory costs incurred by large language models when processing long prompts, stemming from the quadratic complexity of self-attention. While existing compression methods operate solely in token space and overlook redundancy in the embedding space, this paper introduces K-Token Mergingโ€”a novel framework that, for the first time, merges every K consecutive tokens into a single embedding within the latent embedding space via a lightweight encoder. The compressed representation is then processed by a LoRA-finetuned large language model, while generation still employs the original vocabulary. By transcending conventional token-space compression, the method achieves highly efficient input-length reduction with minimal performance degradation. It establishes a Pareto frontier between compression ratio and task performance on Textualized Tree, Amazon Reviews, and CommitPackFT benchmarks, attaining up to 75% compression with negligible loss in accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.
Problem

Research questions and friction points this paper is trying to address.

token compression
latent embedding space
large language models
input length reduction
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent embedding space
token compression
K-Token Merging
LoRA-adapted LLM
Pareto frontier
๐Ÿ”Ž Similar Papers
No similar papers found.