One Size Does Not Fit All: Exploring Variable Thresholds for Distance-Based Multi-Label Text Classification

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In distance-based unsupervised multi-label text classification, fixed similarity thresholds fail to accommodate semantic heterogeneity across models, datasets, and labels. This work is the first to systematically characterize the substantial variation in similarity distributions across models, datasets, and individual labels. To address this, we propose a label-specific dynamic threshold optimization method: leveraging state-of-the-art sentence encoders to obtain dense embeddings of texts and labels, we independently learn optimal similarity thresholds for each label on a validation set, then perform classification via a multi-label information retrieval mechanism. Experiments on multiple real-world datasets show that our approach achieves an average 46% improvement over the conventional 0.5 threshold and a 14% gain over unified threshold strategies. Moreover, it remains robust under label-scarce conditions. The method significantly enhances both the practicality and generalizability of unsupervised multi-label classification.

Technology Category

Application Category

📝 Abstract
Distance-based unsupervised text classification is a method within text classification that leverages the semantic similarity between a label and a text to determine label relevance. This method provides numerous benefits, including fast inference and adaptability to expanding label sets, as opposed to zero-shot, few-shot, and fine-tuned neural networks that require re-training in such cases. In multi-label distance-based classification and information retrieval algorithms, thresholds are required to determine whether a text instance is "similar" to a label or query. Similarity between a text and label is determined in a dense embedding space, usually generated by state-of-the-art sentence encoders. Multi-label classification complicates matters, as a text instance can have multiple true labels, unlike in multi-class or binary classification, where each instance is assigned only one label. We expand upon previous literature on this underexplored topic by thoroughly examining and evaluating the ability of sentence encoders to perform distance-based classification. First, we perform an exploratory study to verify whether the semantic relationships between texts and labels vary across models, datasets, and label sets by conducting experiments on a diverse collection of realistic multi-label text classification (MLTC) datasets. We find that similarity distributions show statistically significant differences across models, datasets and even label sets. We propose a novel method for optimizing label-specific thresholds using a validation set. Our label-specific thresholding method achieves an average improvement of 46% over normalized 0.5 thresholding and outperforms uniform thresholding approaches from previous work by an average of 14%. Additionally, the method demonstrates strong performance even with limited labeled examples.
Problem

Research questions and friction points this paper is trying to address.

Exploring variable thresholds for multi-label text classification using semantic similarity
Optimizing label-specific thresholds to improve distance-based classification performance
Addressing threshold variability across models and datasets in multi-label classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes label-specific thresholds for classification
Uses validation set to optimize threshold parameters
Leverages semantic similarity in embedding space
🔎 Similar Papers
No similar papers found.