MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited fine-grained perception and complex reasoning capabilities of multimodal agents when processing heterogeneous image-text data by proposing a stateful experience-based learning paradigm. The approach abstracts interaction trajectories into atomic decision experiences through hindsight reasoning, constructs a quality-filtered experience repository, and enables policy-driven, precise retrieval during inference. Innovatively integrating a stateful experience modeling mechanism with both breadth-first and depth-first search strategies, it facilitates multi-perspective and adaptive utilization of multimodal experiences, thereby overcoming the representational limitations of conventional trajectory-level retrieval. Experimental results demonstrate that the proposed method significantly outperforms strong baselines on tasks requiring fine-grained visual perception and complex multimodal reasoning, effectively enhancing agent decision-making performance.
📝 Abstract
Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.
Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning
stateful experiences
experience retrieval
visual perception
research agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

stateful experience
multimodal reasoning
hindsight reasoning
experience retrieval
policy-driven search
🔎 Similar Papers
No similar papers found.