Hecate: A Modular Genomic Compressor

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving high compression ratios, fast processing speeds, and efficient random access in lossless genomic data (FASTA/FASTQ) compression. The authors propose a multi-stream conditional coding framework that decomposes sequences into distinct substreams—control symbols, headers, bases, case information, and quality scores—and employs modular encoders operating in concert. Key innovations include alphabet-aware packing, an out-of-band residue side channel, an auxiliary-indexed BWT pipeline, and binary differencing under reference-based modes. Experimental results demonstrate that the method improves compression ratios by 5%–10% while accelerating compression by 2–10× compared to state-of-the-art tools such as MFCompress and NAF, with particularly strong performance on large genomes and datasets exhibiting high similarity to reference sequences, all while supporting efficient random access.

Technology Category

Application Category

📝 Abstract

We present Hecate, a modular lossless genomic compression framework. It is designed around uncommon but practical source-coding choices. Unlike many single-method compressors, Hecate treats compression as a conditional coding problem over coupled FASTA/FASTQ streams (control, headers, nucleotides, case, quality, extras). It uses per-stream codecs under a shared indexed block container. Codecs include alphabet-aware packing with an explicit side channel for out-of-alphabet residues, an auxiliary-index Burrows-Wheeler pipeline with custom arithmetic coding, and a blockwise Markov mixture coder with explicit model-competition signaling. This architecture yields high throughput, exact random-access slicing, and referential mode through streamwise binary differencing. In a comprehensive benchmark suite, Hecate provides the best compression vs. speed trade-offs against state-of-the-art established tools (MFCompress, NAF, bzip3, AGC), with notably stronger behaviour on large genomes and high-similarity referential settings. For the same compression ratio, Hecate is 2 to 10 times faster. When given the same time budget as other algorithms, Hecate achieves up to 5% to 10% better compression.

Problem

Research questions and friction points this paper is trying to address.

genomic compression

lossless compression

FASTA/FASTQ

compression ratio

random access

Innovation

Methods, ideas, or system contributions that make the work stand out.

modular genomic compression

conditional coding

indexed block container