🤖 AI Summary
This work addresses the limitations of existing visual question answering methods in fine-grained perception and external knowledge integration, particularly their lack of an endogenous mechanism for deciding when and how to retrieve information. The authors propose PixSearch, the first end-to-end region-aware large language model, which dynamically generates <search> tokens during encoding to autonomously select text, images, or pixel-level regions as queries and directly outputs masks for visual retrieval, thereby unifying perception and reasoning. PixSearch is the first approach to integrate pixel-level segmentation with retrieval-augmented reasoning in an end-to-end manner, eliminating conventional modular pipelines. It employs a two-stage fine-tuning strategy with interleaved retrieval supervision. Experiments show that PixSearch improves accuracy by 19.7% over full-image retrieval on benchmarks like CRAG-MM while maintaining strong reasoning capabilities across diverse VQA and pure text-based QA tasks.
📝 Abstract
Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emitstokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization, yielding a 19.7% relative gain in accuracy on CRAG-MM compared to whole image retrieval, while retaining competitive reasoning performance on various VQA and text-only QA tasks.