Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the high computational and memory costs incurred by large language models when processing long prompts, stemming from the quadratic complexity of self-attention. While existing compression methods operate solely in token space and overlook redundancy in the embedding space, this paper introduces K-Token Merging—a novel framework that, for the first time, merges every K consecutive tokens into a single embedding within the latent embedding space via a lightweight encoder. The compressed representation is then processed by a LoRA-finetuned large language model, while generation still employs the original vocabulary. By transcending conventional token-space compression, the method achieves highly efficient input-length reduction with minimal performance degradation. It establishes a Pareto frontier between compression ratio and task performance on Textualized Tree, Amazon Reviews, and CommitPackFT benchmarks, attaining up to 75% compression with negligible loss in accuracy.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.

Problem

Research questions and friction points this paper is trying to address.

token compression

latent embedding space

large language models

input length reduction

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent embedding space

token compression

K-Token Merging