🤖 AI Summary
In distance-based unsupervised multi-label text classification, fixed similarity thresholds fail to accommodate semantic heterogeneity across models, datasets, and labels. This work is the first to systematically characterize the substantial variation in similarity distributions across models, datasets, and individual labels. To address this, we propose a label-specific dynamic threshold optimization method: leveraging state-of-the-art sentence encoders to obtain dense embeddings of texts and labels, we independently learn optimal similarity thresholds for each label on a validation set, then perform classification via a multi-label information retrieval mechanism. Experiments on multiple real-world datasets show that our approach achieves an average 46% improvement over the conventional 0.5 threshold and a 14% gain over unified threshold strategies. Moreover, it remains robust under label-scarce conditions. The method significantly enhances both the practicality and generalizability of unsupervised multi-label classification.
📝 Abstract
Distance-based unsupervised text classification is a method within text classification that leverages the semantic similarity between a label and a text to determine label relevance. This method provides numerous benefits, including fast inference and adaptability to expanding label sets, as opposed to zero-shot, few-shot, and fine-tuned neural networks that require re-training in such cases. In multi-label distance-based classification and information retrieval algorithms, thresholds are required to determine whether a text instance is "similar" to a label or query. Similarity between a text and label is determined in a dense embedding space, usually generated by state-of-the-art sentence encoders. Multi-label classification complicates matters, as a text instance can have multiple true labels, unlike in multi-class or binary classification, where each instance is assigned only one label. We expand upon previous literature on this underexplored topic by thoroughly examining and evaluating the ability of sentence encoders to perform distance-based classification. First, we perform an exploratory study to verify whether the semantic relationships between texts and labels vary across models, datasets, and label sets by conducting experiments on a diverse collection of realistic multi-label text classification (MLTC) datasets. We find that similarity distributions show statistically significant differences across models, datasets and even label sets. We propose a novel method for optimizing label-specific thresholds using a validation set. Our label-specific thresholding method achieves an average improvement of 46% over normalized 0.5 thresholding and outperforms uniform thresholding approaches from previous work by an average of 14%. Additionally, the method demonstrates strong performance even with limited labeled examples.