Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenges of modality heterogeneity and inconsistent representations in multimodal information retrieval—spanning text, images, tables, and charts—by proposing a novel paradigm: Visualized Information Retrieval (Vis-IR). Vis-IR unifies diverse content types into semantically faithful screenshots, enabling cross-modal unified retrieval. Key contributions include: (1) the first formal definition of the Vis-IR paradigm; (2) VIRA, the first large-scale annotated screenshot dataset comprising 1.2 million high-quality samples; (3) UniSE, a general-purpose screenshot embedding model that achieves cross-modal semantic alignment via screenshot-level unified embedding and multi-stage vision–language contrastive training; and (4) MVRB, a comprehensive benchmark for multimodal visual retrieval. Experiments demonstrate that UniSE significantly outperforms state-of-the-art multimodal retrievers on MVRB, achieving an average +8.7% improvement in Recall@10 across tasks including image–text cross-retrieval and chart question answering, thereby validating the effectiveness and generalizability of the Vis-IR paradigm.

Technology Category

Application Category

📝 Abstract
With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called extit{Visualized Information Retrieval}, or extbf{Vis-IR}, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called extbf{Screenshots}, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create extbf{VIRA} (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop extbf{UniSE} (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct extbf{MVRB} (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE. Our work will be shared with the community, laying a solid foundation for this emerging field.
Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal information retrieval
Developing universal screenshot embeddings
Creating a comprehensive visualized IR benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified visual format
Large-scale dataset creation
Universal embedding models
🔎 Similar Papers
No similar papers found.