🤖 AI Summary
Traditional autoregressive language models suffer from sequential generation bottlenecks when producing multiple retrieval tokens, resulting in low efficiency and unstable performance gains. This work proposes DiffRetriever, the first approach to integrate multi-token retrieval into diffusion-based language models: by appending K masked positions after the input prompt, it leverages the model’s bidirectional parallelism to generate multiple representative retrieval tokens in a single forward pass, jointly modeling dense and sparse representations. Experiments demonstrate that DiffRetriever significantly outperforms single-token baselines across multiple diffusion backbones. After supervised fine-tuning, DiffRetriever on Dream achieves state-of-the-art retrieval performance on BEIR-7, surpassing PromptReps, DiffEmbed, and RepLLaMA. Notably, even with the model frozen, its query-level oracle performance exceeds that of contrastive learning–based fine-tuned methods.
📝 Abstract
PromptReps showed that an autoregressive language model can be used directly as a retriever by prompting it to generate dense and sparse representations of a query or passage. Extending this to multiple representatives is inefficient for autoregressive models, since tokens must be generated sequentially, and prior multi-token variants did not reliably improve over single-token decoding.
We show that the bottleneck is sequential generation, not the multi-token idea itself. DiffRetriever is a representative-token retriever for diffusion language models: it appends K masked positions to the prompt and reads all K in a single bidirectional forward pass. Across in-domain and out-of-domain evaluation, multi-token DiffRetriever substantially improves over single-token on every diffusion backbone we test, while autoregressive multi-token is flat or negative and pays a latency cost that scales with K where diffusion does not. After supervised fine-tuning, DiffRetriever on Dream is the strongest BEIR-7 retriever in our comparison, ahead of PromptReps, the encoder-style DiffEmbed baseline on the same diffusion backbones, and the contrastively fine-tuned single-vector RepLLaMA. A per-query oracle on the frozen base model exceeds contrastive fine-tuning at the same fixed budget, pointing to adaptive budget selection as future work. Code is available at https://github.com/ielab/diffretriever.