Intra-Cluster Mixup: An Effective Data Augmentation Technique for Complementary-Label Learning

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In complementary label learning (CLL), existing Mixup-based data augmentation methods degrade performance due to complementary label noise introduced by interpolating between heterogeneous samples. This work first identifies and analyzes the underlying failure mechanism. To address it, we propose Intra-Cluster Mixup—a novel, structure-aware augmentation strategy that performs interpolation exclusively within semantically coherent clusters identified via unsupervised clustering. Integrated with contrastive learning and nearest-neighbor sample selection, our method enforces local geometric consistency while suppressing label noise. By avoiding cross-cluster mixing, it preserves the discriminative information inherent in complementary labels. Extensive experiments on MNIST and CIFAR demonstrate absolute accuracy improvements of 30% and 10%, respectively, under both balanced and imbalanced complementary label settings. To our knowledge, this is the first structured, cluster-guided data augmentation framework specifically designed for CLL, establishing a reliable and robust paradigm for noise-resilient representation learning in this setting.

Technology Category

Application Category

📝 Abstract
In this paper, we investigate the challenges of complementary-label learning (CLL), a specialized form of weakly-supervised learning (WSL) where models are trained with labels indicating classes to which instances do not belong, rather than standard ordinary labels. This alternative supervision is appealing because collecting complementary labels is generally cheaper and less labor-intensive. Although most existing research in CLL emphasizes the development of novel loss functions, the potential of data augmentation in this domain remains largely underexplored. In this work, we uncover that the widely-used Mixup data augmentation technique is ineffective when directly applied to CLL. Through in-depth analysis, we identify that the complementary-label noise generated by Mixup negatively impacts the performance of CLL models. We then propose an improved technique called Intra-Cluster Mixup (ICM), which only synthesizes augmented data from nearby examples, to mitigate the noise effect. ICM carries the benefits of encouraging complementary label sharing of nearby examples, and leads to substantial performance improvements across synthetic and real-world labeled datasets. In particular, our wide spectrum of experimental results on both balanced and imbalanced CLL settings justifies the potential of ICM in allying with state-of-the-art CLL algorithms, achieving significant accuracy increases of 30% and 10% on MNIST and CIFAR datasets, respectively.
Problem

Research questions and friction points this paper is trying to address.

Improving complementary-label learning where models use negative class labels
Addressing Mixup augmentation's ineffectiveness due to complementary-label noise
Proposing Intra-Cluster Mixup to enhance performance in weakly-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intra-Cluster Mixup synthesizes data from nearby examples
Mitigates complementary-label noise generated by standard Mixup
Encourages complementary label sharing among similar instances
🔎 Similar Papers
No similar papers found.
T
Tan-Ha Mai
Department of Computer Science and Information Engineering, National Taiwan University
Hsuan-Tien Lin
Hsuan-Tien Lin
Professor of Computer Science and Information Engineering, National Taiwan University
Machine LearningData Mining