🤖 AI Summary
This work addresses the challenge of effectively managing multimodal memory in long-running AI agents, a task hindered by the vast and complex design space that resists efficient exploration via manual tuning or conventional AutoML approaches. The study introduces autonomous research into the design of multimodal lifelong memory systems, establishing an automated experimental pipeline capable of executing approximately 50 iterations without human intervention. Through iterative diagnosis of failures, architectural refinement, data pipeline repair, and co-design of retrieval strategies, the pipeline discovers OmniMem—a unified framework demonstrating that architectural modifications, prompt engineering, and bug fixes exert substantially greater impact on performance than hyperparameter tuning alone. OmniMem achieves state-of-the-art results, improving F1 scores by 411% (from 0.117 to 0.598) on LoCoMo and by 214% (from 0.254 to 0.797) on Mem-Gallery.
📝 Abstract
AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or traditional AutoML to explore effectively. We deploy an autonomous research pipeline to discover OmniMem, a unified multimodal memory framework for lifelong AI agents. Starting from a naïve baseline (F1=0.117 on LoCoMo), the pipeline autonomously executes ${\sim}50$ experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo (0.117$\to$0.598) and +214% on Mem-Gallery (0.254$\to$0.797) relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188\% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML. We provide a taxonomy of six discovery types and identify four properties that make multimodal memory particularly suited for autoresearch, offering guidance for applying autonomous research pipelines to other AI system domains. Code is available at this https://github.com/aiming-lab/OmniMem.