Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address efficiency and scalability bottlenecks in pairwise-constraint active clustering on large-scale data—namely, high computational cost, strong dependence on human annotations, and excessive memory consumption—this paper proposes a dual-sparse graph co-evolution framework grounded in a “data skeleton” abstraction. It constructs lightweight sparse similarity and constraint graphs, enabling nested iterative subgraph optimization, and incorporates a multi-distance metric compatibility mechanism to enhance robustness. The approach significantly reduces both computational complexity and memory footprint per clustering update, while amplifying the information gain of individual user-provided constraints. Experiments demonstrate that the method achieves higher clustering accuracy (+2.1–4.7%) using 35% fewer pairwise constraints on average, accelerates computation by 1.8–3.2×, and reduces memory usage by approximately 40%, exhibiting superior scalability across large datasets.

Technology Category

Application Category

📝 Abstract

In this work, we focus on the efficiency and scalability of pairwise constraint-based active clustering, crucial for processing large-scale data in applications such as data mining, knowledge annotation, and AI model pre-training. Our goals are threefold: (1) to reduce computational costs for iterative clustering updates; (2) to enhance the impact of user-provided constraints to minimize annotation requirements for precise clustering; and (3) to cut down memory usage in practical deployments. To achieve these aims, we propose a graph-based active clustering algorithm that utilizes two sparse graphs: one for representing relationships between data (our proposed data skeleton) and another for updating this data skeleton. These two graphs work in concert, enabling the refinement of connected subgraphs within the data skeleton to create nested clusters. Our empirical analysis confirms that the proposed algorithm consistently facilitates more accurate clustering with dramatically less input of user-provided constraints, and outperforms its counterparts in terms of computational performance and scalability, while maintaining robustness across various distance metrics.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs for iterative clustering updates

Enhancing impact of user constraints to minimize annotation needs

Cutting down memory usage in practical deployment scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse graph structures for active clustering

Data skeleton representation for scalable processing

Refining connected subgraphs to create nested clusters

🔎 Similar Papers

No similar papers found.