UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

📅 2024-10-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing retrieval models often assume homogeneous knowledge sources, rendering them inadequate for real-world scenarios involving heterogeneous modalities (e.g., text, tables, code) and diverse user instructions. To address this, we propose UniHGKR, an instruction-aware heterogeneous retrieval framework that constructs a unified multimodal embedding space. UniHGKR employs a three-stage training paradigm: self-supervised pretraining, text-anchored cross-modal embedding alignment, and instruction-tuned fine-tuning—supporting both BERT- and LLM-based instantiations (e.g., UniHGKR-7B). We further introduce CompMix-IR, the first native benchmark for heterogeneous retrieval, comprising diverse modalities and instruction-driven queries. On CompMix-IR, UniHGKR achieves +6.36% (BERT) and +54.23% (LLM) improvements over strong baselines. Moreover, in the ConvMix open-domain QA task, UniHGKR attains new state-of-the-art performance, with an absolute gain of 5.90 points.

Technology Category

Application Category

📝 Abstract

Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 5.90 points.

Problem

Research questions and friction points this paper is trying to address.

Heterogeneous knowledge retrieval

Instruction-aware retrieval

Unified retrieval space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified retrieval space for heterogeneous knowledge

Instruction-aware retriever fine-tuning

BERT-based scalable framework

🔎 Similar Papers

No similar papers found.

Authors to Follow