🤖 AI Summary
To address the challenge of identifying tail-labels in eXtreme Multi-Label Text Classification (XMTC) caused by highly imbalanced (long-tailed) label distributions, this paper proposes a sparse-dense retriever fusion framework. Methodologically, it jointly leverages sparse retrievers (e.g., BM25), which excel at lexical exact matching, and fine-tuned dense retrievers (e.g., BERT), which capture semantic similarity, within a unified embedding space. Candidate labels are retrieved via approximate nearest neighbor search, and a ranking-based fusion strategy adaptively weights outputs from both retrievers. Crucially, the framework exploits their complementary strengths without requiring retraining. Experiments across multiple XMTC benchmark datasets demonstrate that the method consistently outperforms individual retriever baselines: it significantly improves recall for tail-labels while preserving accuracy on head-labels, yielding substantial gains in tail-label F1-score and overall Precision@K.
📝 Abstract
In the context of Extreme Multi-label Text Classification (XMTC), where labels are assigned to text instances from a large label space, the long-tail distribution of labels presents a significant challenge. Labels can be broadly categorized into frequent, high-coverage extbf{head labels} and infrequent, low-coverage extbf{tail labels}, complicating the task of balancing effectiveness across all labels. To address this, combining predictions from multiple retrieval methods, such as sparse retrievers (e.g., BM25) and dense retrievers (e.g., fine-tuned BERT), offers a promising solution. The fusion of extit{sparse} and extit{dense} retrievers is motivated by the complementary ranking characteristics of these methods. Sparse retrievers compute relevance scores based on high-dimensional, bag-of-words representations, while dense retrievers utilize approximate nearest neighbor (ANN) algorithms on dense text and label embeddings within a shared embedding space. Rank-based fusion algorithms leverage these differences by combining the precise matching capabilities of sparse retrievers with the semantic richness of dense retrievers, thereby producing a final ranking that improves the effectiveness across both head and tail labels.