PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing learned lossless compressors suffer from low compression ratios, limited throughput, and poor robustness on genomic data. To address these bottlenecks, this paper proposes and implements an efficient compression framework based on parallel multi-knowledge learning. Our approach innovatively integrates automated multi-knowledge learning, GPU-accelerated (s,k)-mer encoding, dynamic data chunking, and a step-wise model passing mechanism, enabling flexible single- or multi-GPU deployment. Extensive experiments across 15 real-world genomic datasets demonstrate that our method achieves up to a 73.6% improvement in compression ratio and a 10.7× speedup in throughput over 14 state-of-the-art baselines, while exhibiting superior robustness and competitive memory overhead. This work establishes a scalable, high-performance deep learning paradigm for efficient storage and transmission of large-scale genomic databases.

Technology Category

Application Category

📝 Abstract
Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression & decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel underline{P}arallel underline{M}ulti-underline{K}nowledge underline{L}earning-based underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors' backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ($s$,$k$)-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609% and 73.480%, the average throughput improvement up to 3.036$ imes$ and 10.710$ imes$, respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.
Problem

Research questions and friction points this paper is trying to address.

Improves inadequate compression ratio for genomic databases
Enhances low compression and decompression throughput
Addresses poor compression robustness across diverse datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated multi-knowledge learning framework for compression
GPU-accelerated (s,k)-mer encoder for throughput optimization
Data block partitioning and SMP for parallel acceleration
🔎 Similar Papers
No similar papers found.
H
Hui Sun
College of C.S., Nankai University & College of CCSD, Nanyang Technological University
Yanfeng Ding
Yanfeng Ding
Nankai University
AI4Compression Large Language Models High-Performance Computing Bioinformatics
Liping Yi
Liping Yi
Tenure-Track Associate Professor, Tianjin University
Federated LearningLLM Multi-Agent
Huidong Ma
Huidong Ma
Nankai University
AI4CompressionDeep LearningParallel ComputingBioinformatics
G
Gang Wang
College of C.S., Nankai-Baidu Joint Lab, TMCC, SysNet, DISSec, GTIISC, Nankai University
X
Xiaoguang Liu
College of C.S., Nankai-Baidu Joint Lab, TMCC, SysNet, DISSec, GTIISC, Nankai University
C
Cheng Zhong
School of Computer, Electronics and Information, Guangxi University
Wentong Cai
Wentong Cai
Professor of Computer Science, Nanyang Technological University
Parallel and Distributed ComputingModeling and Simulation