AgentOCR: Reimagining Agent History via Optical Self-Compression

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the high token cost and memory overhead in multi-turn interactive agents caused by rapidly growing textual history. To mitigate this, the authors propose compressing the observation-action history into compact images, leveraging the high information density of visual tokens to reduce the language model’s input burden. A segmented hashing cache is introduced to avoid redundant rendering, and a reinforcement learning–driven adaptive compression mechanism enables the agent to autonomously adjust its compression rate, jointly optimizing task performance and computational efficiency. Evaluated on ALFWorld and search-based question answering tasks, the method maintains over 95% of the original performance while reducing token consumption by more than 50% and achieving a 20× speedup in rendering.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.

Problem

Research questions and friction points this paper is trying to address.

agent history

token efficiency

memory usage

textual history

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

AgentOCR

optical self-compression

segment optical caching