Bloom Filter Encoding for Machine Learning

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper addresses three key challenges in machine learning data preprocessing: high memory overhead, significant privacy leakage risk, and difficulty preserving structural information. To tackle these, we propose a novel Bloom filter–based encoding method that maps raw samples into compact, irreversible binary vectors. This work is the first to systematically validate Bloom filters as a general-purpose, privacy-enhancing feature transformation technique. Empirical evaluation across six heterogeneous datasets—using XGBoost, DNNs, CNNs, and logistic regression—demonstrates classification accuracy comparable to that achieved on raw data (average degradation <1.2%), while reducing memory consumption by up to 87%. Crucially, the original features are provably unrecoverable, thereby eliminating reconstruction-based privacy threats. Our core contribution lies in establishing both the theoretical applicability and empirical superiority of Bloom filters for lightweight, privacy-preserving preprocessing.

Technology Category

Application Category

📝 Abstract

We present a method that uses the Bloom filter transform to preprocess data for machine learning. Each sample is encoded into a compact, privacy-preserving bit array. This reduces memory use and protects the original data while keeping enough structure for accurate classification. We test the method on six datasets: SMS Spam Collection, ECG200, Adult 50K, CDC Diabetes, MNIST, and Fashion MNIST. Four classifiers are used: Extreme Gradient Boosting, Deep Neural Networks, Convolutional Neural Networks, and Logistic Regression. Results show that models trained on Bloom filter encodings achieve accuracy similar to models trained on raw data or other transforms. At the same time, the method provides memory savings while enhancing privacy. These results suggest that the Bloom filter transform is an efficient preprocessing approach for diverse machine learning tasks.

Problem

Research questions and friction points this paper is trying to address.

Encodes data into privacy-preserving bit arrays

Reduces memory usage while maintaining classification accuracy

Provides efficient preprocessing for diverse machine learning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bloom filter transform encodes data into bit arrays

Method reduces memory use and enhances privacy

Achieves similar accuracy to raw data models

🔎 Similar Papers

Latent Point Collapse on a Low Dimensional Embedding in Deep Neural Network Classifiers

2023-10-12Citations: 0

💼 Related Jobs

Research Engineer, Privacy

OpenAI

$380K – $445K • Offers Equity

San Francisco

Machine Learning Engineer