🤖 AI Summary
This work proposes FASTR, a compute-native, lossless alternative format for high-throughput sequencing FASTQ data, which addresses the high storage and transmission costs stemming from data redundancy. FASTR uniquely encodes each base and its corresponding quality score into a single 8-bit value, representing reads compactly as image-inspired vectors. The format is fully lossless, reversible, and machine learning–friendly, enabling direct downstream analysis—such as with minimap2—without requiring explicit decompression. Experimental results demonstrate that FASTR reduces raw file size by at least twofold; when combined with general-purpose compressors, it achieves higher compression ratios and accelerates compression by 1.75–4.8× and decompression by 1.75–2.34× compared to existing methods, substantially speeding up real-time genomic analysis pipelines.
📝 Abstract
Motivation High-throughput sequencing (HTS) enables population-scale genomics but generates massive datasets, creating bottlenecks in storage, transfer, and analysis. FASTQ, the standard format for over two decades, stores one byte per base and one byte per quality score, leading to inefficient I/O, high storage costs, and redundancy. Existing compression tools can mitigate some issues, but often introduce costly decompression or complex dependency issues. Results We introduce FASTR, a lossless, computation-native successor to FASTQ that encodes each nucleotide together with its base quality score into a single 8-bit value. FASTR reduces file size by at least 2× while remaining fully reversible and directly usable for downstream analyses. Applying general-purpose compression tools on FASTR consistently yields higher compression ratios, 2.47, 3.64, and 4.8× faster compression, and 2.34, 1.96, 1.75× faster decompression than on FASTQ across Illumina, HiFi, and ONT reads. FASTR is machine-learning-ready, allowing reads to be consumed directly as numerical vectors or image-like representations. We provide a highly parallel software ecosystem for FASTQ–FASTR conversion and show that FASTR integrates with existing tools, such as minimap2, with minimal interface changes and no performance overhead. By eliminating decompression costs and reducing data movement, FASTR lays the foundation for scalable genomics analyses and real-time sequencing workflows. Availability and Implementation https://github.com/ALSER-Lab/FASTR