Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address efficiency and scalability bottlenecks in pairwise-constraint active clustering on large-scale data—namely, high computational cost, strong dependence on human annotations, and excessive memory consumption—this paper proposes a dual-sparse graph co-evolution framework grounded in a “data skeleton” abstraction. It constructs lightweight sparse similarity and constraint graphs, enabling nested iterative subgraph optimization, and incorporates a multi-distance metric compatibility mechanism to enhance robustness. The approach significantly reduces both computational complexity and memory footprint per clustering update, while amplifying the information gain of individual user-provided constraints. Experiments demonstrate that the method achieves higher clustering accuracy (+2.1–4.7%) using 35% fewer pairwise constraints on average, accelerates computation by 1.8–3.2×, and reduces memory usage by approximately 40%, exhibiting superior scalability across large datasets.

Technology Category

Application Category

📝 Abstract
In this work, we focus on the efficiency and scalability of pairwise constraint-based active clustering, crucial for processing large-scale data in applications such as data mining, knowledge annotation, and AI model pre-training. Our goals are threefold: (1) to reduce computational costs for iterative clustering updates; (2) to enhance the impact of user-provided constraints to minimize annotation requirements for precise clustering; and (3) to cut down memory usage in practical deployments. To achieve these aims, we propose a graph-based active clustering algorithm that utilizes two sparse graphs: one for representing relationships between data (our proposed data skeleton) and another for updating this data skeleton. These two graphs work in concert, enabling the refinement of connected subgraphs within the data skeleton to create nested clusters. Our empirical analysis confirms that the proposed algorithm consistently facilitates more accurate clustering with dramatically less input of user-provided constraints, and outperforms its counterparts in terms of computational performance and scalability, while maintaining robustness across various distance metrics.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs for iterative clustering updates
Enhancing impact of user constraints to minimize annotation needs
Cutting down memory usage in practical deployment scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse graph structures for active clustering
Data skeleton representation for scalable processing
Refining connected subgraphs to create nested clusters
🔎 Similar Papers
No similar papers found.
Wen-Bo Xie
Wen-Bo Xie
Southwest Petroleum University
Machine LearningData MiningGraph Mining
Xun Fu
Xun Fu
School of Computer Science and Software Engineering, Southwest Petroleum University, Chengdu 610500, People’s Republic of China
B
Bin Chen
School of Computer Science and Software Engineering, Southwest Petroleum University, Chengdu 610500, People’s Republic of China
Yan-Li Lee
Yan-Li Lee
Xihua University
Graph MiningNLPComputational Socioeconomics
T
Tao Deng
School of Computer Science and Software Engineering, Southwest Petroleum University, Chengdu 610500, People’s Republic of China
T
Tian Zou
School of Computer Science and Software Engineering, Southwest Petroleum University, Chengdu 610500, People’s Republic of China
X
Xin Wang
School of Computer Science and Software Engineering, Southwest Petroleum University, Chengdu 610500, People’s Republic of China
Z
Zhen Liu
Web Sciences Center, University of Electronic Science and Technology of China, Chengdu 611731, People’s Republic of China
Jaideep Srivastava
Jaideep Srivastava
Professor, University of Minnesota
Health AnalyticsSocial ComputingSocial Network AnalysisWeb MiningDigital Health