🤖 AI Summary
This work addresses the lack of systematic evaluation of scientific agents in multi-turn interaction, multimodal evidence integration, and long-term memory utilization. To bridge this gap, we propose the first multi-turn, multimodal benchmark grounded in real-world scientific workflows, requiring agents to perform cross-paper retrieval, align textual and visual evidence, and answer complex questions demanding cross-document comparison and fusion of multiple figures within a short time horizon. The benchmark introduces a fine-grained, process-level evaluation framework that emphasizes active retrieval, multi-source fusion, and memory-augmented reasoning. Experimental results reveal that even state-of-the-art models achieve only 29.23% accuracy on the challenging test set, highlighting substantial room for improvement and providing the community with a reproducible evaluation platform.
📝 Abstract
Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.