Recovering Imbalanced Clusters via Gradient-Based Projection Pursuit

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of detecting high-dimensional imbalanced clusters—such as those exhibiting severe class skew or Bernoulli–Rademacher structure—in the small-sample regime. We propose a gradient-optimized projection pursuit method tailored to this setting. Theoretically, we establish, for the first time, that imbalanced clusters are *more* recoverable by projection pursuit than balanced ones, and we develop a general sample complexity framework whose tightness is validated against low-degree polynomial computational lower bounds. Methodologically, our approach integrates planted vector modeling, multi-distribution robustness design, and an efficient gradient-based optimization objective. Empirical evaluation on small-sample subsets of FashionMNIST and Human Activity Recognition demonstrates significant improvements over classical projection pursuit and clustering baselines. Our work thus provides both theoretical guarantees and a practical algorithm for unsupervised structural discovery in high-dimensional imbalanced data.

Technology Category

Application Category

📝 Abstract
Projection Pursuit is a classic exploratory technique for finding interesting projections of a dataset. We propose a method for recovering projections containing either Imbalanced Clusters or a Bernoulli-Rademacher distribution using a gradient-based technique to optimize the projection index. As sample complexity is a major limiting factor in Projection Pursuit, we analyze our algorithm's sample complexity within a Planted Vector setting where we can observe that Imbalanced Clusters can be recovered more easily than balanced ones. Additionally, we give a generalized result that works for a variety of data distributions and projection indices. We compare these results to computational lower bounds in the Low-Degree-Polynomial Framework. Finally, we experimentally evaluate our method's applicability to real-world data using FashionMNIST and the Human Activity Recognition Dataset, where our algorithm outperforms others when only a few samples are available.
Problem

Research questions and friction points this paper is trying to address.

Recovering imbalanced clusters via gradient projection
Analyzing sample complexity in Planted Vector setting
Evaluating method on real-world datasets effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-based projection index optimization
Recovery of Imbalanced Clusters
Low sample complexity analysis
🔎 Similar Papers
No similar papers found.
Martin Eppert
Martin Eppert
Technical University of Munich
Machine LearningStatistics
Satyaki Mukherjee
Satyaki Mukherjee
National University of Singapore
ProbabilityRandom Matrix TheoryMachine learning
D
D. Ghoshdastidar
Technical University of Munich School of Computation, Information and Technology - I7 Boltzmannstr. 3 85748 Garching b. München Germany