🤖 AI Summary
To address the challenge of noisy human-annotated data in industrial deep learning—where label noise severely degrades model performance—this paper proposes a lightweight, model-agnostic, closed-loop human-in-the-loop data cleaning method. The method innovatively identifies suspicious noisy labels by jointly leveraging model prediction confidence and cross-model or cross-iteration prediction consistency, then automatically triggers human review and relabeling without modifying model architecture or training pipelines. It is compatible with diverse tasks—including classification, sequence labeling, object detection, text generation, and CTR prediction—as well as mainstream annotation platforms. On development sets, it consistently achieves scores above 90. Human evaluation confirms high accuracy in noise identification, while extensive multi-task experiments demonstrate its effectiveness, generalizability, and scalability across domains and model types.
📝 Abstract
In industry deep learning application, our manually labeled data has a certain number of noisy data. To solve this problem and achieve more than 90 score in dev dataset, we present a simple method to find the noisy data and re-label the noisy data by human, given the model predictions as references in human labeling. In this paper, we illustrate our idea for a broad set of deep learning tasks, includes classification, sequence tagging, object detection, sequence generation, click-through rate prediction. The dev dataset evaluation results and human evaluation results verify our idea.