To Case or Not to Case: An Empirical Study in Learned Sparse Retrieval

📅 2026-01-24

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This study investigates the impact of using cased versus uncased backbone language models on retrieval performance in Learned Sparse Retrieval (LSR). Addressing concerns that the prevalence of cased-only pretrained models may degrade LSR effectiveness, we systematically evaluate multiple cased/uncased model pairs across several benchmark datasets and introduce text lowercasing as a preprocessing step. Our experiments reveal that, under default settings, cased models significantly underperform their uncased counterparts; however, this performance gap is entirely eliminated when inputs are lowercased. Token-level analysis further demonstrates that, under lowercasing, cased models scarcely leverage case information and behave nearly identically to uncased models. This work provides the first empirical elucidation of this phenomenon’s underlying mechanism, offering both theoretical grounding and practical guidance for effectively deploying modern cased models in LSR systems.

Technology Category

Application Category

📝 Abstract

Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have relied almost exclusively on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, the most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; however, this gap can be eliminated by pre-processing the text to lowercase. Moreover, our token-level analysis reveals that, under lowercasing, cased models almost entirely suppress cased vocabulary items and behave effectively as uncased models, explaining their restored performance. This result broadens the applicability of recent cased models to the LSR setting and facilitates the integration of stronger backbone architectures into sparse retrieval. The complete code and implementation for this project are available at: https://github.com/lionisakis/Uncased-vs-cased-models-in-LSR

Problem

Research questions and friction points this paper is trying to address.

Learned Sparse Retrieval

cased models

uncased models

vocabulary mismatch

sparse retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned Sparse Retrieval

cased vs uncased models

lowercasing