Morphing-based Compression for Data-centric ML Pipelines

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing lossless matrix compression methods fail to effectively capture structural redundancies introduced during data cleaning, augmentation, and feature engineering, leading to suboptimal efficiency in data-centric ML pipelines. This paper introduces BWARE, the first framework to deeply embed lossless compression within the outer loop of data engineering—enabling end-to-end co-design of compression and feature transformation. Its key contributions are: (1) a workload-aware compression mechanism supporting lightweight, on-the-fly morphing without decompression; (2) a column correlation- and sparsity-aware morphing mapping; and (3) direct feature transformation in the compressed domain. Experiments demonstrate that end-to-end training time reduces from days to hours, while memory utilization improves significantly, I/O overhead decreases, and instruction-level parallelism is enhanced.

Technology Category

Application Category

📝 Abstract

Data-centric ML pipelines extend traditional machine learning (ML) pipelines -- of feature transformations and ML model training -- by outer loops for data cleaning, augmentation, and feature engineering to create high-quality input data. Existing lossless matrix compression applies lightweight compression schemes to numeric matrices and performs linear algebra operations such as matrix-vector multiplications directly on the compressed representation but struggles to efficiently rediscover structural data redundancy. Compressed operations are effective at fitting data in available memory, reducing I/O across the storage-memory-cache hierarchy, and improving instruction parallelism. The applied data cleaning, augmentation, and feature transformations provide a rich source of information about data characteristics such as distinct items, column sparsity, and column correlations. In this paper, we introduce BWARE -- an extension of AWARE for workload-aware lossless matrix compression -- that pushes compression through feature transformations and engineering to leverage information about structural transformations. Besides compressed feature transformations, we introduce a novel technique for lightweight morphing of a compressed representation into workload-optimized compressed representations without decompression. BWARE shows substantial end-to-end runtime improvements, reducing the execution time for training data-centric ML pipelines from days to hours.

Problem

Research questions and friction points this paper is trying to address.

Efficiently compressing data-centric ML pipeline matrices

Leveraging structural transformations for lossless compression

Reducing ML pipeline runtime via workload-optimized compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends AWARE for workload-aware matrix compression

Lightweight morphing without decompression for optimization

Leverages structural transformations for efficient compression

🔎 Similar Papers

No similar papers found.