🤖 AI Summary
Generative retrieval excels in zero-shot settings but struggles to generalize to unseen documents at inference time. To address this, we propose the Hierarchical Corpus Encoder (HCE), the first approach to explicitly incorporate document-level hierarchical structure—particularly sibling relationships—into contrastive learning. HCE jointly optimizes generative retrieval and dense indexing via three synergistic components: (1) a hierarchical contrastive loss leveraging structural priors, (2) generative modeling of document IDs, and (3) fusion of dense vector representations. Crucially, HCE supports dynamic corpus updates (i.e., insertion and deletion of documents) and operates effectively in both zero-shot and supervised regimes. Extensive experiments demonstrate that HCE consistently outperforms state-of-the-art baselines—including DSI and NCI—across unsupervised zero-shot and supervised retrieval tasks. It preserves the flexibility and scalability inherent to dense indexing while substantially improving generalization to previously unseen documents.
📝 Abstract
Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is a challenge to support documents not seen during training. We identify the performance of generative retrieval lies in contrastive training between sibling nodes in a document hierarchy. This motivates our proposal, the hierarchical corpus encoder (HCE), which can be supported by traditional dense encoders. Our experiments show that HCE achieves superior results than generative retrieval models under both unsupervised zero-shot and supervised settings, while also allowing the easy addition and removal of documents to the index.