🤖 AI Summary
This study addresses the presence of semantically irrelevant “anomalous weights” in sparse retrieval models like SPLADE, which undermine interpretability and whose origins and impacts remain unclear. The authors formally define “anomaly score” to quantify the lexical utility of expanded terms and introduce a method for comparing anomalous weights across vocabularies and sparsity levels. Through systematic replication of SPLADE-v2 under varying loss functions, backbone architectures, vocabulary sizes, and sparsity regularization strategies, they find that larger vocabularies exacerbate anomalous weighting, while strong regularization effectively suppresses it. Although anomalous weights enhance in-domain retrieval performance, they do not improve cross-domain generalization. This work elucidates the relationship between anomalous weights and model scale, regularization, and domain adaptability, offering new insights for improving the interpretability of sparse retrieval systems.
📝 Abstract
Learned sparse retrieval models such as SPLADE combine the effectiveness of neural architectures with the efficiency of inverted indices. As these models assign weights to terms from a fixed vocabulary, interpretability is often touted as a major benefit of these models. However, the emergence of wacky weights, i.e., expansion terms that appear semantically unrelated to the input, limits interpretability. While prior research has anecdotally observed this phenomenon, there is a lack of systematic understanding regarding their origins, prevalence, and contribution to retrieval effectiveness. In this paper, we reproduce SPLADE-v2 to systematically investigate wacky weights across the SPLADE family of models. We present a comprehensive dissection of wacky weights, providing a formal definition of wackiness based on the lexical utility of expansion terms. Furthermore, we introduce a novel measure to compare the prevalence of these tokens across models with varying vocabularies and sparsity levels. Beyond reproducing the original SPLADE-v2, we train it with various loss functions, datasets, and backbone transformers to isolate the factors contributing to wackiness. Our results show that larger vocabularies are associated with a higher prevalence of wacky tokens, while stricter sparsity regularizers are associated with lower prevalence. Finally, we find that wacky weights are used primarily for in-domain effectiveness rather than out-of-domain generalization.