Read As Human: Compressing Context via Parallelizable Close Reading and Skimming

📅 2026-02-02
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational inefficiency and information redundancy faced by large language models when processing long-context inputs. The authors propose RAM, a novel framework that introduces human-inspired close-reading and skimming mechanisms into context compression: highly relevant passages are preserved intact (close-reading), while less relevant segments are compressed into query-guided summary vectors (skimming). These explicit textual tokens and implicit summary vectors are jointly encoded in parallel and fused before being fed into the decoder. To refine the decision boundary between close-reading and skimming, the framework incorporates contrastive learning. Experimental results demonstrate that RAM outperforms existing methods across multiple question-answering and summarization benchmarks, achieving up to 12× end-to-end speedup on inputs averaging 16K tokens and reaching up to 32K tokens in length.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query-segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).
Problem

Research questions and friction points this paper is trying to address.

large language models
long-context
computational inefficiency
redundant information
context compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

context compression
adaptive reading strategy
parallelizable encoding
query-guided summarization
contrastive learning
🔎 Similar Papers
No similar papers found.
Jiwei Tang
Jiwei Tang
Tsinghua University
Natural Language ProcessingLarge Language Model
S
Shilei Liu
Future Living Lab of Alibaba
Zhicheng Zhang
Zhicheng Zhang
Carnegie Mellon University
Reinforcement LearningExplainable RL
Qingsong Lv
Qingsong Lv
Tsinghua University
Computer ScienceMachine Learning
R
Runsong Zhao
Northeastern University, China
T
Tingwei Lu
Tsinghua University
Langming Liu
Langming Liu
PhD, City University of Hongkong
RecommendationLarge Language ModelsFederated Learning
H
Haibin Chen
Future Living Lab of Alibaba
Y
Yujin Yuan
Future Living Lab of Alibaba
H
Hai-Tao Zheng
Tsinghua University, Pengcheng Laboratory
W
Wenbo Su
Future Living Lab of Alibaba
Bo Zheng
Bo Zheng
Researcher, Alibaba Group
AINetworkE-Commerce