C2T-ID: Converting Semantic Codebooks to Textual Document Identifiers for Generative Search

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In generative retrieval, document identifiers (docids) must balance semantic expressiveness with controllable search space: numeric codebooks lack semantics, while plain-text docids induce excessive decoding vocabulary and severe error propagation. This paper proposes a semantic-enhanced textual codebook method that innovatively integrates hierarchical clustering with iterative keyword substitution, augmented by a two-level semantic smoothing strategy. The approach preserves a compact search space while significantly improving docid fluency and semantic representability. Evaluated on Natural Questions and Taobao product search tasks, our method consistently outperforms baselines—including atomic IDs, semantic codebooks, and plain-text docids—demonstrating the effectiveness of jointly optimizing semantic expressivity and retrievability in generative retrieval.

Technology Category

Application Category

📝 Abstract
Designing document identifiers (docids) that carry rich semantic information while maintaining tractable search spaces is a important challenge in generative retrieval (GR). Popular codebook methods address this by building a hierarchical semantic tree and constraining generation to its child nodes, yet their numeric identifiers cannot leverage the large language model's pretrained natural language understanding. Conversely, using text as docid provides more semantic expressivity but inflates the decoding space, making the system brittle to early-step errors. To resolve this trade-off, we propose C2T-ID: (i) first construct semantic numerical docid via hierarchical clustering; (ii) then extract high-frequency metadata keywords and iteratively replace each numeric label with its cluster's top-K keywords; and (iii) an optional two-level semantic smoothing step further enhances the fluency of C2T-ID. Experiments on Natural Questions and Taobao's product search demonstrate that C2T-ID significantly outperforms atomic, semantic codebook, and pure-text docid baselines, demonstrating its effectiveness in balancing semantic expressiveness with search space constraints.
Problem

Research questions and friction points this paper is trying to address.

Designing semantic document identifiers for generative search systems
Balancing semantic expressivity with tractable decoding space constraints
Converting numeric codebooks to textual identifiers using clustering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts hierarchical numeric docids to keywords
Uses cluster top-K keywords for semantic enrichment
Applies two-level smoothing for enhanced fluency
🔎 Similar Papers
No similar papers found.