UniCoRN: Unified Commented Retrieval Network with LMMs

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods suffer from a fundamental disconnect between retrieval and generation in complex compositional visual queries: multimodal retrieval lacks fine-grained semantic reasoning and natural language explanation capability, while large multimodal models (LMMs) exhibit strong generative capacity but cannot autonomously retrieve supporting entities. To bridge this gap, we propose UniCoRN (Unified Commented Retrieval Network). Our contributions are threefold: (1) We formally define and benchmark the novel “Commented Retrieval” task—jointly retrieving relevant images and generating explanatory natural language comments; (2) We design an entity adapter module that end-to-end couples a frozen LMM with a multimodal retriever, enabling synchronized image matching and comment generation; (3) We adopt a frozen-tuning paradigm for efficiency. Experiments show UniCoRN improves compositional retrieval recall by 4.5%, and achieves +14.9% METEOR and +18.4% BEM gains on Commented Retrieval—substantially outperforming RAG-based baselines.

Technology Category

Application Category

📝 Abstract
Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address these limitations with UniCoRN, a Unified Commented Retrieval Network that combines the strengths of composed multimodal retrieval methods and generative language approaches, going beyond Retrieval-Augmented Generation (RAG). We introduce an entity adapter module to inject the retrieved multimodal entities back into the LMM, so it can attend to them while generating answers and comments. By keeping the base LMM frozen, UniCoRN preserves its original capabilities while being able to perform both retrieval and text generation tasks under a single integrated framework. To assess these new abilities, we introduce the Commented Retrieval task (CoR) and a corresponding dataset, with the goal of retrieving an image that accurately answers a given question and generate an additional textual response that provides further clarification and details about the visual information. We demonstrate the effectiveness of UniCoRN on several datasets showing improvements of +4.5% recall over the state of the art for composed multimodal retrieval and of +14.9% METEOR / +18.4% BEM over RAG for commenting in CoR.
Problem

Research questions and friction points this paper is trying to address.

Handles complex multimodal retrieval queries
Integrates retrieval with generative language models
Improves accuracy and detail in visual responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Commented Retrieval Network
Entity adapter module
Frozen Large Multimodal Models
🔎 Similar Papers
No similar papers found.