Hecate: A Modular Genomic Compressor

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of simultaneously achieving high compression ratios, fast processing speeds, and efficient random access in lossless genomic data (FASTA/FASTQ) compression. The authors propose a multi-stream conditional coding framework that decomposes sequences into distinct substreams—control symbols, headers, bases, case information, and quality scores—and employs modular encoders operating in concert. Key innovations include alphabet-aware packing, an out-of-band residue side channel, an auxiliary-indexed BWT pipeline, and binary differencing under reference-based modes. Experimental results demonstrate that the method improves compression ratios by 5%–10% while accelerating compression by 2–10× compared to state-of-the-art tools such as MFCompress and NAF, with particularly strong performance on large genomes and datasets exhibiting high similarity to reference sequences, all while supporting efficient random access.

Technology Category

Application Category

📝 Abstract
We present Hecate, a modular lossless genomic compression framework. It is designed around uncommon but practical source-coding choices. Unlike many single-method compressors, Hecate treats compression as a conditional coding problem over coupled FASTA/FASTQ streams (control, headers, nucleotides, case, quality, extras). It uses per-stream codecs under a shared indexed block container. Codecs include alphabet-aware packing with an explicit side channel for out-of-alphabet residues, an auxiliary-index Burrows-Wheeler pipeline with custom arithmetic coding, and a blockwise Markov mixture coder with explicit model-competition signaling. This architecture yields high throughput, exact random-access slicing, and referential mode through streamwise binary differencing. In a comprehensive benchmark suite, Hecate provides the best compression vs. speed trade-offs against state-of-the-art established tools (MFCompress, NAF, bzip3, AGC), with notably stronger behaviour on large genomes and high-similarity referential settings. For the same compression ratio, Hecate is 2 to 10 times faster. When given the same time budget as other algorithms, Hecate achieves up to 5% to 10% better compression.
Problem

Research questions and friction points this paper is trying to address.

genomic compression
lossless compression
FASTA/FASTQ
compression ratio
random access
Innovation

Methods, ideas, or system contributions that make the work stand out.

modular genomic compression
conditional coding
indexed block container
alphabet-aware packing
Markov mixture coder
🔎 Similar Papers
No similar papers found.
K
Kamila Szewczyk
Algorithmic Bioinformatics, Saarland University, Saarbrücken, Germany; Center for Bioinformatics, Saarland Informatics Campus, Germany
Sven Rahmann
Sven Rahmann
Center for Bioinformatics Saar and Saarland Informatics Campus, Saarland University
Algorithmic BioinformaticsSequence AnalysisHashingFiltersCombinatorial Optimization