🤖 AI Summary
To address the high cost and inherent human error in acquiring high-quality labeled data, this paper proposes the “Probably Approximately Correct Labeling” (PAC-Label) theoretical framework—the first to provide provable upper bounds on labeling error for AI-assisted annotation, thereby bridging the gap between statistical learning theory and practical data curation. Methodologically, it integrates large language models (for text), pretrained vision models (for images), and AlphaFold (for protein structures), combining multimodal predictions with statistical calibration to generate label sets that are approximately correct with quantifiable probability guarantees. Empirical validation across text classification, image recognition, and protein folding tasks demonstrates substantial reduction in annotation cost while ensuring, with high probability, that the overall labeling error remains below a user-specified threshold. The core contributions are: (1) a theoretically grounded, verifiable error-control mechanism for labeling, and (2) a novel cross-modal calibration paradigm.
📝 Abstract
Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such"expert"labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.