dreaMLearning: Data Compression Assisted Machine Learning

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Deep learning relies heavily on large-scale labeled datasets and high computational resources, resulting in prohibitive training costs and deployment constraints. To address this, we propose EntroGeDe, the first end-to-end machine learning framework operating directly in the lossless compression domain. Leveraging entropy-driven generalized deduplication, EntroGeDe performs feature extraction, model training, and inference on compressed data—without decompression. The framework supports multimodal data, diverse tasks (e.g., classification and regression), and heterogeneous models (including neural networks and classical learners). Experiments demonstrate that EntroGeDe achieves negligible accuracy degradation while accelerating training by up to 8.8×, reducing memory footprint by 10×, and cutting storage requirements by 42%. By breaking the conventional “decompress-then-learn” paradigm, EntroGeDe enables efficient, low-overhead machine learning—particularly beneficial for federated learning, edge AI, and distributed training.

Technology Category

Application Category

📝 Abstract

Despite rapid advancements, machine learning, particularly deep learning, is hindered by the need for large amounts of labeled data to learn meaningful patterns without overfitting and immense demands for computation and storage, which motivate research into architectures that can achieve good performance with fewer resources. This paper introduces dreaMLearning, a novel framework that enables learning from compressed data without decompression, built upon Entropy-based Generalized Deduplication (EntroGeDe), an entropy-driven lossless compression method that consolidates information into a compact set of representative samples. DreaMLearning accommodates a wide range of data types, tasks, and model architectures. Extensive experiments on regression and classification tasks with tabular and image data demonstrate that dreaMLearning accelerates training by up to 8.8x, reduces memory usage by 10x, and cuts storage by 42%, with a minimal impact on model performance. These advancements enhance diverse ML applications, including distributed and federated learning, and tinyML on resource-constrained edge devices, unlocking new possibilities for efficient and scalable learning.

Problem

Research questions and friction points this paper is trying to address.

Enables machine learning from compressed data without decompression

Reduces computational and storage demands for deep learning

Improves efficiency for resource-constrained ML applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning from compressed data without decompression

Uses entropy-driven lossless compression method

Accelerates training and reduces resource usage

🔎 Similar Papers

AutoFlow: An Autoencoder-based Approach for IP Flow Record Compression with Minimal Impact on Traffic Classification