🤖 AI Summary
To address core challenges in long-context compression for large language models—including structural distortion, positional drift, loss of fine-grained information, and poor compatibility with closed-source APIs—this paper proposes an explicit structured compression framework based on Elementary Discourse Units (EDUs). Methodologically, it introduces a novel “structure-first, then-select” paradigm: EDUs are first parsed via source-position-anchored discourse parsing to construct a dependency-aware relation tree; subsequently, query-relevant subtrees are selected and linearized. Key technical components include the LingoEDU parser, a lightweight EDU ranking module, and source-aligned structural modeling. Contributions are threefold: (1) StructBench, the first human-annotated benchmark explicitly designed for structural understanding in compression; (2) state-of-the-art EDU structure prediction performance; and (3) significant improvements over mainstream compression methods on long-document QA and Deep Search tasks—achieving both higher accuracy and reduced computational overhead.
📝 Abstract
Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.