π€ AI Summary
Distantly supervised named entity recognition (DS-NER) suffers from high false-negative rates due to incompleteness of underlying knowledge bases. To address this, we propose the Constrained Multi-class Positive-Unlabeled (CMPU) learning frameworkβa novel approach that, for the first time, incorporates non-negativity constraints into the risk estimator of multi-class PU learning, thereby relaxing the implicit assumption of positive-example completeness inherent in conventional PU methods. Theoretically, this constraint enhances model robustness and mitigates overfitting, while being integrated with explicit modeling of distant supervision noise and risk-minimization optimization. Evaluated on two benchmark datasets annotated via multiple heterogeneous knowledge bases, CMPU consistently outperforms state-of-the-art DS-NER methods, achieving absolute F1-score improvements of 3.2β5.8 percentage points. These results empirically validate both the effectiveness and generalizability of our constrained risk estimation strategy.
π Abstract
Distantly supervised named entity recognition (DS-NER) has been proposed to exploit the automatically labeled training data by external knowledge bases instead of human annotations. However, it tends to suffer from a high false negative rate due to the inherent incompleteness. To address this issue, we present a novel approach called extbf{C}onstraint extbf{M}ulti-class extbf{P}ositive and extbf{U}nlabeled Learning (CMPU), which introduces a constraint factor on the risk estimator of multiple positive classes. It suggests that the constraint non-negative risk estimator is more robust against overfitting than previous PU learning methods with limited positive data. Solid theoretical analysis on CMPU is provided to prove the validity of our approach. Extensive experiments on two benchmark datasets that were labeled using diverse external knowledge sources serve to demonstrate the superior performance of CMPU in comparison to existing DS-NER methods.