Split Conformal Prediction under Data Contamination

📅 2024-07-10
🏛️ International Symposium on Conformal and Probabilistic Prediction with Applications
📈 Citations: 7
Influential: 1
📄 PDF
🤖 AI Summary
This paper investigates the robustness of split conformal prediction under data contamination: when a small fraction of calibration samples—e.g., due to label noise or distributional shift—originate from a contaminated distribution, standard methods suffer significant degradation in coverage on clean test points. To address this, we propose Contamination Robust Conformal Prediction (CRCP), a novel framework featuring quantile-robust calibration, empirical contamination distribution modeling, and theoretical analysis of coverage error bounds. We theoretically establish that the coverage loss is bounded above by a linear function of the contamination proportion. Empirically, CRCP maintains over 90% nominal coverage even under 10% contamination, substantially outperforming standard conformal prediction, while preserving predictive efficiency and practical applicability.

Technology Category

Application Category

📝 Abstract
Conformal prediction is a non-parametric technique for constructing prediction intervals or sets from arbitrary predictive models under the assumption that the data is exchangeable. It is popular as it comes with theoretical guarantees on the marginal coverage of the prediction sets and the split conformal prediction variant has a very low computational cost compared to model training. We study the robustness of split conformal prediction in a data contamination setting, where we assume a small fraction of the calibration scores are drawn from a different distribution than the bulk. We quantify the impact of the corrupted data on the coverage and efficiency of the constructed sets when evaluated on"clean"test points, and verify our results with numerical experiments. Moreover, we propose an adjustment in the classification setting which we call Contamination Robust Conformal Prediction, and verify the efficacy of our approach using both synthetic and real datasets.
Problem

Research questions and friction points this paper is trying to address.

Studies robustness of split conformal prediction with data contamination
Quantifies impact of corrupted calibration data on coverage and efficiency
Proposes adjustment method for contamination in classification setting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust split conformal prediction under data contamination
Adjustment for classification called Contamination Robust Conformal Prediction
Quantify impact on coverage and efficiency with experiments