Hierarchical corpus encoder: Fusing generative retrieval and dense indices

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Generative retrieval excels in zero-shot settings but struggles to generalize to unseen documents at inference time. To address this, we propose the Hierarchical Corpus Encoder (HCE), the first approach to explicitly incorporate document-level hierarchical structure—particularly sibling relationships—into contrastive learning. HCE jointly optimizes generative retrieval and dense indexing via three synergistic components: (1) a hierarchical contrastive loss leveraging structural priors, (2) generative modeling of document IDs, and (3) fusion of dense vector representations. Crucially, HCE supports dynamic corpus updates (i.e., insertion and deletion of documents) and operates effectively in both zero-shot and supervised regimes. Extensive experiments demonstrate that HCE consistently outperforms state-of-the-art baselines—including DSI and NCI—across unsupervised zero-shot and supervised retrieval tasks. It preserves the flexibility and scalability inherent to dense indexing while substantially improving generalization to previously unseen documents.

Technology Category

Application Category

📝 Abstract

Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is a challenge to support documents not seen during training. We identify the performance of generative retrieval lies in contrastive training between sibling nodes in a document hierarchy. This motivates our proposal, the hierarchical corpus encoder (HCE), which can be supported by traditional dense encoders. Our experiments show that HCE achieves superior results than generative retrieval models under both unsupervised zero-shot and supervised settings, while also allowing the easy addition and removal of documents to the index.

Problem

Research questions and friction points this paper is trying to address.

Improves zero-shot retrieval performance

Supports unseen documents in training

Enhances document index flexibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical corpus encoder fusion

Generative retrieval with dense indices

Contrastive training in document hierarchy

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation