🤖 AI Summary
Compressor-predictor design in LM-based agent systems lacks theoretical foundations, and compressor quality remains difficult to evaluate in a task-agnostic manner. Method: This paper pioneers modeling agent architectures as information transmission processes and introduces a mutual-information-based framework for quantifying compressor quality—enabling downstream-task-free performance prediction and system optimization. Contribution/Results: The framework reveals that scaling compressors yields higher cost-performance benefits than scaling predictors. Experiments across Qwen, Llama, and Phi model families demonstrate that a 7B compressor improves accuracy by 1.6×, text conciseness by 4.6×, and information density by 5.5× over a 1.5B baseline; a 3B local compressor recovers 99% of state-of-the-art model accuracy while reducing API costs by 74%. This work establishes the first task-agnostic, information-theoretic design principle for LM agent systems.
📝 Abstract
Agentic language model (LM) systems power modern applications like "Deep Research" and "Claude Code," and leverage multi-LM architectures to overcome context limitations. Beneath their apparent diversity lies a recurring pattern: smaller "compressor" LMs (that can even run locally) distill raw context into compact text that is then consumed by larger "predictor" LMs. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance. In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps. We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to quantify compression quality in a task-independent way. We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token. A 7B Qwen-2.5 compressor, for instance, is $1.6 imes$ more accurate, $4.6 imes$ more concise, and conveys $5.5 imes$ more bits of mutual information per token than its 1.5B sibling. Across datasets, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors. Applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover $99%$ of frontier-LM accuracy at $26%$ of API costs.