🤖 AI Summary
This work addresses the challenge of long-horizon agent reasoning under limited context windows, where conventional text-based memory methods inefficiently allocate tokens uniformly and struggle to preserve critical information. The authors propose MemOCR, a novel approach that introduces visual layout into memory compression for the first time: by converting structured rich-text elements—such as headings and highlights—into rendered images, it leverages spatial arrangement to adaptively allocate information density, prioritizing key evidence under strict context budgets. Integrating reinforcement learning with budget-aware training, the model maintains robustness across varying memory constraints. Experimental results demonstrate that MemOCR significantly outperforms strong textual baselines on both multi-hop and single-hop question answering benchmarks, achieving notably more efficient reasoning performance especially under extreme context compression.
📝 Abstract
Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.