Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the challenge of extracting effective term representations from neural dense retrievers for use in traditional sparse retrieval models such as BM25, without requiring additional sparse supervision or fine-tuning. The proposed approach introduces a sparse autoencoder on top of a frozen dense retriever to directly disentangle a latent vocabulary whose token distributions closely approximate Zipf’s law, enabling plug-and-play sparse retrieval. This work is the first to reveal that dense retrieval models inherently encode structured and interpretable lexical representations. The resulting method matches or even surpasses the performance of the original dense retriever and explicit sparse approaches like SPLADE across multiple retrieval benchmarks, and significantly outperforms single-vector dense retrievers on tasks such as those in the LIMIT benchmark.

📝 Abstract

We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse retrieval supervision whatsoever, and can be readily applied to any dense retriever. Latent Terms is able to match or outperform single-vector scoring methods from its own base model as well as comparable SPLADE variants. In addition, it substantially outperforms its base model on LIMIT, a task specifically designed to highlight the failures of single-vector retrieval. Overall, our results highlight that neural retrievers contain more expressive and indexable structure than their default scoring functions expose, but that other methods can nonetheless be leveraged.

Problem

Research questions and friction points this paper is trying to address.

dense retrieval

sparse retrieval

latent vocabulary

BM25

neural retrievers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Terms

dense retrieval

sparse autoencoder