Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model

📅 2026-02-02

📈 Citations: 1

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the instability of existing context compression methods in long-context scenarios, which often overlook the impact of data distribution on compression efficacy. From a data-centric perspective, the study systematically investigates how the input data and the intrinsic knowledge distribution of large language models jointly influence compression quality. The authors propose a semantic integrity evaluation framework based on autoencoders and introduce an input entropy metric under a frozen decoder setting. They reveal, for the first time, a negative correlation between input entropy and compression quality, and demonstrate that distributional discrepancies between encoder and decoder inputs significantly diminish compression gains. Building on these insights, they formulate targeted data-side optimization strategies, offering both theoretical grounding and practical guidance for improving context compression performance.

Technology Category

Application Category

📝 Abstract

The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model's internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.

Problem

Research questions and friction points this paper is trying to address.

data distribution

context compression

large language models

semantic integrity

data-centric perspective

Innovation

Methods, ideas, or system contributions that make the work stand out.

data-centric perspective

context compression

data distribution