OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the inefficiency and information loss in long-horizon tasks faced by large language model agents due to limited context windows. To overcome these challenges, the authors propose OCR-Memory, a novel framework that leverages visual modality as a high-density memory medium by rendering historical trajectories into images embedded with unique visual identifiers. Retrieval is achieved through a “locate-and-transcribe” mechanism: key regions are first localized using visual anchors, followed by optical character recognition (OCR) to extract precise textual content, thereby avoiding free-form generation and mitigating hallucination. Experimental results demonstrate that, under strict context-length constraints, OCR-Memory significantly improves task success rates across diverse long-horizon scenarios, effectively expanding memory capacity while ensuring faithful recall of past interactions.

📝 Abstract

Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended histories. However, existing agent memory systems are fundamentally constrained by text-context budgets: storing or revisiting raw trajectories is prohibitively token-expensive, while summarization and text-only retrieval trade token savings for information loss and fragmented evidence. To address this limitation, we propose Optical Context Retrieval Memory (OCR-Memory), a memory framework that leverages the visual modality as a high-density representation of agent experience, enabling retention of arbitrarily long histories with minimal prompt overhead at retrieval time. Specifically, OCR-Memory renders historical trajectories into images annotated with unique visual identifiers. OCR-Memory retrieves stored experience via a \emph{locate-and-transcribe} paradigm that selects relevant regions through visual anchors and retrieves the corresponding verbatim text, avoiding free-form generation and reducing hallucination. Experiments on long-horizon agent benchmarks show consistent gains under strict context limits, demonstrating that optical encoding increases effective memory capacity while preserving faithful evidence recovery.

Problem

Research questions and friction points this paper is trying to address.

long-horizon agent memory

text-context budget

information loss

evidence fragmentation

token efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optical Context Retrieval

Agent Memory

Visual Modality