🤖 AI Summary
This work addresses the limited fine-grained perception and complex reasoning capabilities of multimodal agents when processing heterogeneous image-text data by proposing a stateful experience-based learning paradigm. The approach abstracts interaction trajectories into atomic decision experiences through hindsight reasoning, constructs a quality-filtered experience repository, and enables policy-driven, precise retrieval during inference. Innovatively integrating a stateful experience modeling mechanism with both breadth-first and depth-first search strategies, it facilitates multi-perspective and adaptive utilization of multimodal experiences, thereby overcoming the representational limitations of conventional trajectory-level retrieval. Experimental results demonstrate that the proposed method significantly outperforms strong baselines on tasks requiring fine-grained visual perception and complex multimodal reasoning, effectively enhancing agent decision-making performance.
📝 Abstract
Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.