From Web to Pixels: Bringing Agentic Search into Visual Perception

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing visual perception methods are constrained by information within images or static knowledge, struggling to address open-world object recognition and localization tasks that rely on external facts, recent events, or complex relationships. This work proposes the Perception Deep Research framework, which pioneers the integration of agent-driven deep web search into visual perception. It introduces the WebEye benchmark dataset and designs the Pixel-Searcher agent workflow, enabling knowledge-intensive pixel-level localization, segmentation, and visual question answering through object anchoring, multi-hop knowledge retrieval, identity resolution, and end-to-end binding from retrieved evidence to image pixels. The approach achieves state-of-the-art open-source performance across three tasks, establishing a verifiable and scalable new paradigm for open-world visual understanding.

📝 Abstract

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

Problem

Research questions and friction points this paper is trying to address.

visual perception

open-world

object grounding

knowledge-intensive

external evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic search

visual perception

knowledge-intensive grounding