VIRTUE: Visual-Interactive Text-Image Universal Embedder

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal embedding models lack visual interaction capabilities, hindering fine-grained representation learning conditioned on user-specified visual prompts—such as points, bounding boxes, or masks—and thus limiting localized intent grounding and entity-level cross-modal understanding. To address this, we propose VIRTUE, the first vision-interactive multimodal embedding model, which unifies segmentation priors with vision-language modeling to achieve precise alignment between textual queries and image regions. We further introduce SCaR, a novel benchmark explicitly designed to evaluate vision-prompt-driven semantic localization—a capability not systematically assessed before. Experiments demonstrate that VIRTUE achieves average improvements of 3.1–8.5% across 36 general-purpose MMEB tasks and outperforms state-of-the-art methods by 15.2–20.3% on all 5 SCaR tasks. These results confirm its superior capacity for entity-level representation learning and interactive cross-modal retrieval.

Technology Category

Application Category

📝 Abstract
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.
Problem

Research questions and friction points this paper is trying to address.

Enables visual-interactive region specification for embedding models
Extends segmentation and vision-language models to representation learning
Addresses entity-level information retrieval with visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends segmentation model for visual prompts
Integrates vision-language model into representation learning
Enables region-specific embedding via interactive inputs
🔎 Similar Papers
No similar papers found.