Hierarchical corpus encoder: Fusing generative retrieval and dense indices

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative retrieval excels in zero-shot settings but struggles to generalize to unseen documents at inference time. To address this, we propose the Hierarchical Corpus Encoder (HCE), the first approach to explicitly incorporate document-level hierarchical structure—particularly sibling relationships—into contrastive learning. HCE jointly optimizes generative retrieval and dense indexing via three synergistic components: (1) a hierarchical contrastive loss leveraging structural priors, (2) generative modeling of document IDs, and (3) fusion of dense vector representations. Crucially, HCE supports dynamic corpus updates (i.e., insertion and deletion of documents) and operates effectively in both zero-shot and supervised regimes. Extensive experiments demonstrate that HCE consistently outperforms state-of-the-art baselines—including DSI and NCI—across unsupervised zero-shot and supervised retrieval tasks. It preserves the flexibility and scalability inherent to dense indexing while substantially improving generalization to previously unseen documents.

Technology Category

Application Category

📝 Abstract
Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is a challenge to support documents not seen during training. We identify the performance of generative retrieval lies in contrastive training between sibling nodes in a document hierarchy. This motivates our proposal, the hierarchical corpus encoder (HCE), which can be supported by traditional dense encoders. Our experiments show that HCE achieves superior results than generative retrieval models under both unsupervised zero-shot and supervised settings, while also allowing the easy addition and removal of documents to the index.
Problem

Research questions and friction points this paper is trying to address.

Improves zero-shot retrieval performance
Supports unseen documents in training
Enhances document index flexibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical corpus encoder fusion
Generative retrieval with dense indices
Contrastive training in document hierarchy
🔎 Similar Papers
No similar papers found.