EntroGD: Efficient Compression and Accurate Direct Analytics on Compressed Data

πŸ“… 2025-11-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the poor scalability of Generalized Deduplication (GD) on high-dimensional data and the prohibitively high O(ndΒ²) time complexity of GreedyGD, this paper proposes EntroGDβ€”a novel entropy-guided two-stage bit-selection framework. First, it constructs fidelity-preserving condensed samples; then, it jointly optimizes dictionary bases and bias terms in the bit representation space, reducing time complexity to O(nd). EntroGD supports both dictionary-based encoding and direct analytical operations (e.g., clustering) in the compressed domain, achieving high compression ratios with negligible accuracy loss. Extensive experiments across 18 datasets demonstrate that EntroGD accelerates configuration by 53.5Γ— and clustering by 31.6Γ— over GreedyGD, while matching the compression quality of state-of-the-art methods and incurring negligible precision degradation.

Technology Category

Application Category

πŸ“ Abstract
Generalized Deduplication (GD) enables lossless compression with direct analytics on compressed data by dividing data into emph{bases} and emph{deviations} and performing dictionary encoding on the former. However, GD algorithms face scalability challenges for high-dimensional data. For example, the GreedyGD algorithm relies on an iterative bit-selection process across $d$-dimensional data resulting in $O(nd^2)$ complexity for $n$ data rows to select bits to be used as bases and deviations. Although the $n$ data rows can be reduced during training at the expense of performance, highly dimensional data still experiences a marked loss in performance. This paper introduces EntroGD, an entropy-guided GD framework that reduces complexity of the bit-selection algorithm to $O(nd)$. EntroGD operates considers a two-step process. First, it generates condensed samples to preserve analytic fidelity. Second, it applies entropy-guided bit selection to maximize compression efficiency. Across 18 datasets of varying types and dimensionalities, EntroGD achieves compression performance comparable to GD-based and universal compressors, while reducing configuration time by up to 53.5$ imes$ over GreedyGD and accelerating clustering by up to 31.6$ imes$ over the original data with negligible accuracy loss by performing analytics on the condensed samples, which are much fewer than original samples. Thus, EntroGD provides an efficient and scalable solution to performing analytics directly on compressed data.
Problem

Research questions and friction points this paper is trying to address.

Scalability challenges in Generalized Deduplication for high-dimensional data compression
High computational complexity of existing GD algorithms limiting practical applications
Performance degradation when handling highly dimensional datasets during compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy-guided bit selection reduces complexity to O(nd)
Generates condensed samples to preserve analytic fidelity
Enables direct analytics on compressed data efficiently
πŸ”Ž Similar Papers
No similar papers found.
Xiaobo Zhao
Xiaobo Zhao
Postdoc, Aarhus University
Data CompressionEdge IntelligenceIoTNetwork codingDistributed Storage Systems
D
D. Lucani
DIGIT, Department of Electrical and Computer Engineering, Aarhus University