POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

241K/year
🤖 AI Summary
This work addresses the limitations of current large models, whose static parameters hinder effective retrieval of external evidence in long-horizon, knowledge-intensive visual reasoning tasks. The authors propose POINTS-Seeker-8B, the first end-to-end trained multimodal agentic search model. It incorporates an Agentic Seeding phase to elicit proactive interaction capabilities and introduces a V-Fold adaptive history compression mechanism to mitigate information overload during extended interactions. By integrating multimodal modeling, agentic behavior guidance, and visual-spatial context folding, the method achieves substantial performance gains over existing approaches across six diverse benchmarks, significantly advancing the state of the art in long-horizon, knowledge-intensive visual reasoning.

Technology Category

Application Category

📝 Abstract
While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.
Problem

Research questions and friction points this paper is trying to address.

multimodal search
long-horizon interaction
knowledge-intensive reasoning
visual evidence retrieval
agentic behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal agentic search
Agentic Seeding
V-Fold
history-aware compression
visual reasoning
🔎 Similar Papers