🤖 AI Summary
This work addresses a critical gap in existing multimodal reasoning benchmarks, which predominantly focus on pre-integrated contexts and overlook the ability to synthesize evidence dispersed across independent, heterogeneous sources such as conversations, images, and documents. To this end, we introduce SMMBench, the first benchmark designed to systematically evaluate source-distributed multimodal memory capabilities. It encompasses four core dimensions: cross-source reasoning, conflict resolution, preference inference, and memory-driven behavior prediction. Constructed from real-world, multi-source heterogeneous data, SMMBench comprises 1,877 samples derived from 264 distinct sources. Empirical evaluation reveals that current models exhibit significant performance deficiencies on this benchmark, underscoring the challenge and research importance of multimodal memory integration for intelligent agents operating in complex, real-world environments.
📝 Abstract
Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre-assembled contexts, but under-evaluate whether agents can use evidence distributed across independently originated sources. We argue that source-distributed memory composition is an important and under-examined bottleneck in multimodal agent memory, especially when relevant evidence is fragmented across heterogeneous artifacts such as conversations, profiles, screenshots, tables, images, and documents. To address this gap, we introduce Source-distributed Multimodal Memory Benchmark(SMMBench), which measures whether agents can retrieve, align, and compose multimodal evidence scattered across multiple sources rather than reason within a single curated context. SMMBench evaluates four core capabilities: (1) cross-source multimodal reasoning; (2) conflict resolution; (3) preference reasoning; (4) memory-grounded action prediction. The benchmark contains 1877 samples grounded in 264 sources. Experiments on representative memory-style and retrieval-based baselines show that current systems still struggle on these capabilities, positioning source-distributed multimodal memory as an important and still under-evaluated challenge for multimodal agents. Our data are available at https://huggingface.co/datasets/HuacanChai/SMMBench.