๐ค AI Summary
Existing retrieval models often assume homogeneous knowledge sources, rendering them inadequate for real-world scenarios involving heterogeneous modalities (e.g., text, tables, code) and diverse user instructions. To address this, we propose UniHGKR, an instruction-aware heterogeneous retrieval framework that constructs a unified multimodal embedding space. UniHGKR employs a three-stage training paradigm: self-supervised pretraining, text-anchored cross-modal embedding alignment, and instruction-tuned fine-tuningโsupporting both BERT- and LLM-based instantiations (e.g., UniHGKR-7B). We further introduce CompMix-IR, the first native benchmark for heterogeneous retrieval, comprising diverse modalities and instruction-driven queries. On CompMix-IR, UniHGKR achieves +6.36% (BERT) and +54.23% (LLM) improvements over strong baselines. Moreover, in the ConvMix open-domain QA task, UniHGKR attains new state-of-the-art performance, with an absolute gain of 5.90 points.
๐ Abstract
Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 5.90 points.