🤖 AI Summary
This work addresses the challenge of simultaneously achieving high compression ratios, fast processing speeds, and efficient random access in lossless genomic data (FASTA/FASTQ) compression. The authors propose a multi-stream conditional coding framework that decomposes sequences into distinct substreams—control symbols, headers, bases, case information, and quality scores—and employs modular encoders operating in concert. Key innovations include alphabet-aware packing, an out-of-band residue side channel, an auxiliary-indexed BWT pipeline, and binary differencing under reference-based modes. Experimental results demonstrate that the method improves compression ratios by 5%–10% while accelerating compression by 2–10× compared to state-of-the-art tools such as MFCompress and NAF, with particularly strong performance on large genomes and datasets exhibiting high similarity to reference sequences, all while supporting efficient random access.
📝 Abstract
We present Hecate, a modular lossless genomic compression framework. It is designed around uncommon but practical source-coding choices. Unlike many single-method compressors, Hecate treats compression as a conditional coding problem over coupled FASTA/FASTQ streams (control, headers, nucleotides, case, quality, extras). It uses per-stream codecs under a shared indexed block container. Codecs include alphabet-aware packing with an explicit side channel for out-of-alphabet residues, an auxiliary-index Burrows-Wheeler pipeline with custom arithmetic coding, and a blockwise Markov mixture coder with explicit model-competition signaling. This architecture yields high throughput, exact random-access slicing, and referential mode through streamwise binary differencing. In a comprehensive benchmark suite, Hecate provides the best compression vs. speed trade-offs against state-of-the-art established tools (MFCompress, NAF, bzip3, AGC), with notably stronger behaviour on large genomes and high-similarity referential settings. For the same compression ratio, Hecate is 2 to 10 times faster. When given the same time budget as other algorithms, Hecate achieves up to 5% to 10% better compression.