🤖 AI Summary
To address the challenge of simultaneously preserving structural relationships, ensuring scalability, and maintaining decision boundaries in instance selection (IS) for large-scale high-dimensional data, this paper proposes a graph attention–driven IS framework. Our key contributions are: (1) the first integration of graph attention mechanisms into IS to explicitly model higher-order structural dependencies among instances; (2) a hierarchical hashing scheme—supporting single-level, multi-level, and multi-view variants—for multi-granularity similarity modeling; and (3) a distance-aware hierarchical mini-batch sampling strategy that ensures class balance while optimizing computational efficiency. Evaluated on 39 benchmark datasets, our method achieves over 96% data compression with classification accuracy matching or surpassing state-of-the-art IS approaches. The multi-view variant notably enhances performance on high-dimensional, complex data, while the mini-batch strategy attains an optimal trade-off between efficiency and accuracy.
📝 Abstract
Instance selection (IS) is important in machine learning for reducing dataset size while keeping key characteristics. Current IS methods often struggle with capturing complex relationships in high-dimensional spaces and scale with large datasets. This paper introduces a graph attention-based instance selection (GAIS) method that uses attention mechanisms to identify informative instances through their structural relationships in graph representations. We present two approaches for scalable graph construction: a distance-based mini-batch sampling technique that reduces computation through strategic batch processing, and a hierarchical hashing approach that allows for efficient similarity computation through random projections. The mini-batch approach keeps class distributions through stratified sampling, while the hierarchical hashing method captures relationships at multiple granularities through single-level, multi-level, and multi-view variants. Experiments across 39 datasets show that GAIS achieves reduction rates above 96% while maintaining or improving model performance relative to state-of-the-art IS methods. The findings shows that the distance-based mini-batch approach offers an optimal balance of efficiency and effectiveness for large-scale datasets, while multi-view variants provide superior performance for complex, high-dimensional data, demonstrating that attention-based importance scoring can effectively identify instances crucial for maintaining decision boundaries without requiring exhaustive pairwise comparisons.