🤖 AI Summary
Existing visual perception methods are constrained by information within images or static knowledge, struggling to address open-world object recognition and localization tasks that rely on external facts, recent events, or complex relationships. This work proposes the Perception Deep Research framework, which pioneers the integration of agent-driven deep web search into visual perception. It introduces the WebEye benchmark dataset and designs the Pixel-Searcher agent workflow, enabling knowledge-intensive pixel-level localization, segmentation, and visual question answering through object anchoring, multi-hop knowledge retrieval, identity resolution, and end-to-end binding from retrieved evidence to image pixels. The approach achieves state-of-the-art open-source performance across three tasks, establishing a verifiable and scalable new paradigm for open-world visual understanding.
📝 Abstract
Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.