🤖 AI Summary
This study addresses the detrimental impact of noisy labels—particularly sentence-level labels derived from crowdsourced document-level annotations—on language model performance in non-topical classification tasks such as sentence difficulty prediction. It presents the first systematic evaluation of multiple denoising strategies, including Gaussian Mixture Models (GMM), Co-Teaching, noise transition matrices, and label smoothing, within a multilingual sentence difficulty prediction framework using multilingual BERT for cross-lingual training. The work reveals a complementary relationship between the inherent robustness of pretrained models and explicit denoising techniques, and introduces the largest multilingual sentence difficulty corpus to date. Experiments demonstrate that on small datasets, GMM-based denoising improves AUC from 0.52 to 0.92, with combined methods reaching 0.93; on larger datasets, while performance gains are modest (0.92→0.94), approximately 20% of noisy samples are successfully filtered, substantially enhancing corpus quality.
📝 Abstract
Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the Area-Under-the-Curve score from 0.52 to 0.92, or to 0.93 when de-noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20\% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see https://github.com/Nouran-Khallaf/denoising-difficulty