An Information Theoretic Perspective on Agentic System Design

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

260K/year
🤖 AI Summary
Compressor-predictor design in LM-based agent systems lacks theoretical foundations, and compressor quality remains difficult to evaluate in a task-agnostic manner. Method: This paper pioneers modeling agent architectures as information transmission processes and introduces a mutual-information-based framework for quantifying compressor quality—enabling downstream-task-free performance prediction and system optimization. Contribution/Results: The framework reveals that scaling compressors yields higher cost-performance benefits than scaling predictors. Experiments across Qwen, Llama, and Phi model families demonstrate that a 7B compressor improves accuracy by 1.6×, text conciseness by 4.6×, and information density by 5.5× over a 1.5B baseline; a 3B local compressor recovers 99% of state-of-the-art model accuracy while reducing API costs by 74%. This work establishes the first task-agnostic, information-theoretic design principle for LM agent systems.

Technology Category

Application Category

📝 Abstract
Agentic language model (LM) systems power modern applications like "Deep Research" and "Claude Code," and leverage multi-LM architectures to overcome context limitations. Beneath their apparent diversity lies a recurring pattern: smaller "compressor" LMs (that can even run locally) distill raw context into compact text that is then consumed by larger "predictor" LMs. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance. In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps. We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to quantify compression quality in a task-independent way. We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token. A 7B Qwen-2.5 compressor, for instance, is $1.6 imes$ more accurate, $4.6 imes$ more concise, and conveys $5.5 imes$ more bits of mutual information per token than its 1.5B sibling. Across datasets, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors. Applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover $99%$ of frontier-LM accuracy at $26%$ of API costs.
Problem

Research questions and friction points this paper is trying to address.

Designing compressor-predictor LM systems lacks theoretical guidance for performance optimization.
Attributing performance gains to compression vs prediction requires costly task-specific evaluations.
Determining optimal compressor and predictor choices for downstream tasks is ad hoc.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using mutual information to measure compression quality
Larger compressors are more accurate and token-efficient
Scaling compressors is more effective than scaling predictors