AgentOCR: Reimagining Agent History via Optical Self-Compression

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high token cost and memory overhead in multi-turn interactive agents caused by rapidly growing textual history. To mitigate this, the authors propose compressing the observation-action history into compact images, leveraging the high information density of visual tokens to reduce the language model’s input burden. A segmented hashing cache is introduced to avoid redundant rendering, and a reinforcement learning–driven adaptive compression mechanism enables the agent to autonomously adjust its compression rate, jointly optimizing task performance and computational efficiency. Evaluated on ALFWorld and search-based question answering tasks, the method maintains over 95% of the original performance while reducing token consumption by more than 50% and achieving a 20× speedup in rendering.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.
Problem

Research questions and friction points this paper is trying to address.

agent history
token efficiency
memory usage
textual history
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

AgentOCR
optical self-compression
segment optical caching
visual token representation
compression-aware reward
🔎 Similar Papers
No similar papers found.