Scalable Graph Attention-based Instance Selection via Mini-Batch Sampling and Hierarchical Hashing

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the challenge of simultaneously preserving structural relationships, ensuring scalability, and maintaining decision boundaries in instance selection (IS) for large-scale high-dimensional data, this paper proposes a graph attention–driven IS framework. Our key contributions are: (1) the first integration of graph attention mechanisms into IS to explicitly model higher-order structural dependencies among instances; (2) a hierarchical hashing scheme—supporting single-level, multi-level, and multi-view variants—for multi-granularity similarity modeling; and (3) a distance-aware hierarchical mini-batch sampling strategy that ensures class balance while optimizing computational efficiency. Evaluated on 39 benchmark datasets, our method achieves over 96% data compression with classification accuracy matching or surpassing state-of-the-art IS approaches. The multi-view variant notably enhances performance on high-dimensional, complex data, while the mini-batch strategy attains an optimal trade-off between efficiency and accuracy.

Technology Category

Application Category

📝 Abstract

Instance selection (IS) is important in machine learning for reducing dataset size while keeping key characteristics. Current IS methods often struggle with capturing complex relationships in high-dimensional spaces and scale with large datasets. This paper introduces a graph attention-based instance selection (GAIS) method that uses attention mechanisms to identify informative instances through their structural relationships in graph representations. We present two approaches for scalable graph construction: a distance-based mini-batch sampling technique that reduces computation through strategic batch processing, and a hierarchical hashing approach that allows for efficient similarity computation through random projections. The mini-batch approach keeps class distributions through stratified sampling, while the hierarchical hashing method captures relationships at multiple granularities through single-level, multi-level, and multi-view variants. Experiments across 39 datasets show that GAIS achieves reduction rates above 96% while maintaining or improving model performance relative to state-of-the-art IS methods. The findings shows that the distance-based mini-batch approach offers an optimal balance of efficiency and effectiveness for large-scale datasets, while multi-view variants provide superior performance for complex, high-dimensional data, demonstrating that attention-based importance scoring can effectively identify instances crucial for maintaining decision boundaries without requiring exhaustive pairwise comparisons.

Problem

Research questions and friction points this paper is trying to address.

Enhance instance selection efficiency

Handle high-dimensional data complexity

Scale with large datasets effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph attention-based instance selection

Mini-batch sampling technique

Hierarchical hashing approach

🔎 Similar Papers

GRAPES: Learning to Sample Graphs for Scalable Graph Neural Networks