Complement Submodular Information Measures for Balanced and Robust Data Selection

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in traditional submodular optimization, which often neglects the structural balance between a selected subset and its complement, thereby compromising the overall representativeness and robustness of data partitions. To overcome this, we propose the Complementary Submodular Information (CSI) framework, introducing for the first time a complement-aware class of submodular objectives that explicitly quantifies shared structural information between a subset and its complement. This yields a selection criterion that jointly optimizes representativeness and balance. We instantiate CSI variants for classical functions—including facility location, graph cut, and LogDet—and establish their approximate monotonicity under bounded curvature, along with a (1−1/e) approximation guarantee for greedy maximization. Experiments demonstrate that CSI significantly outperforms baselines in implicit slice-aware subset selection, effectively preserving rare semantic structures, suppressing noise, and enhancing downstream task performance.
📝 Abstract
Submodular optimization has become a fundamental paradigm for data selection, retrieval, summarization, and representation learning due to its ability to model coverage, diversity, and representativeness. However, classical submodular objectives optimize only the selected subset and do not explicitly preserve structural information between the selected subset and the remaining data. In many modern machine learning applications, including train/validation/test splitting, benchmark construction, and robust subset selection, the quality of a selection depends critically on preserving balanced structure across both the selected subset and its complement. In this work, we introduce Complement Submodular Information (CSI), a new class of complement-aware submodular objectives that quantify shared structural information between a subset and its complement. Our framework induces complement-aware variants of several classical submodular functions including Facility Location, Graph Cut, LogDet, Saturated Coverage, Set Cover, Probabilistic Set Cover, and Feature Based Functions. We analyze the theoretical properties of CSI objectives and show that they exhibit approximate monotonicity under bounded curvature conditions, leading to near-$(1-1/e)$ greedy approximation guarantees. Empirically, CSI objectives consistently outperform standard submodular objectives on robust hidden-slice-aware subset selection. In particular, CSI objectives significantly improve preservation of coherent rare/tail semantic structure while simultaneously suppressing noisy and isolated outliers, leading to substantially improved downstream predictive performance. Synthetic experiments further illustrate how different CSI instantiations capture complementary notions of representativeness, diversity, connectivity, and balanced neighborhood preservation.
Problem

Research questions and friction points this paper is trying to address.

submodular optimization
data selection
complement structure
balanced representation
robust subset selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Complement Submodular Information
submodular optimization
balanced data selection
robust subset selection
structure preservation
🔎 Similar Papers
No similar papers found.