🤖 AI Summary
This work addresses the computational inefficiency and information redundancy faced by large language models when processing long-context inputs. The authors propose RAM, a novel framework that introduces human-inspired close-reading and skimming mechanisms into context compression: highly relevant passages are preserved intact (close-reading), while less relevant segments are compressed into query-guided summary vectors (skimming). These explicit textual tokens and implicit summary vectors are jointly encoded in parallel and fused before being fed into the decoder. To refine the decision boundary between close-reading and skimming, the framework incorporates contrastive learning. Experimental results demonstrate that RAM outperforms existing methods across multiple question-answering and summarization benchmarks, achieving up to 12× end-to-end speedup on inputs averaging 16K tokens and reaching up to 32K tokens in length.
📝 Abstract
Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query-segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).