🤖 AI Summary
Existing data corruption studies are fragmented across specific scenarios, lacking a unified theoretical framework and systematic mitigation strategies. Method: We propose the first general corruption modeling framework based on Markov kernels, formalizing corruption as arbitrary modifications to the data distribution, hypothesis class, or loss function. We establish a provably complete taxonomy—distinguishing, for the first time, label corruption (affecting only the loss) from attribute or joint corruption (simultaneously affecting both the hypothesis class and the loss). Building on this, we introduce a generalized loss correction paradigm, deriving provably effective correction formulas for attribute and joint corruption under weaker assumptions than conventional approaches. Contribution/Results: Our framework unifies disparate corruption models and terminologies, providing a rigorous foundation for robustness analysis and algorithm design in supervised learning. It enables principled treatment of previously isolated corruption types and advances theoretical understanding of learning under distributional and structural perturbations.
📝 Abstract
Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature predominantly focuses on specific settings and learning scenarios, lacking a unified view of corruption modelization and mitigation. In this work, we develop a general theory of corruption, which incorporates all modifications to a supervised learning problem, including changes in model class and loss. Focusing on changes to the underlying probability distributions via Markov kernels, our approach leads to three novel opportunities. First, it enables the construction of a novel, provably exhaustive corruption framework, distinguishing among different corruption types. This serves to unify existing models and establish a consistent nomenclature. Second, it facilitates a systematic analysis of corruption's consequences on learning tasks, by comparing Bayes risks in the clean and corrupted scenarios. Notably, while label corruptions affect only the loss function, attribute corruptions additionally influence the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand existing loss-correction methods for label corruption to handle dependent corruption types. Our findings highlight the necessity to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements to encompass more corruption types. We provide such a paradigm as well as loss correction formulas in the attribute and joint corruption cases.