FASTR: Reimagining FASTQ via Compact Image-inspired Representation

📅 2026-01-23
🏛️ bioRxiv
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes FASTR, a compute-native, lossless alternative format for high-throughput sequencing FASTQ data, which addresses the high storage and transmission costs stemming from data redundancy. FASTR uniquely encodes each base and its corresponding quality score into a single 8-bit value, representing reads compactly as image-inspired vectors. The format is fully lossless, reversible, and machine learning–friendly, enabling direct downstream analysis—such as with minimap2—without requiring explicit decompression. Experimental results demonstrate that FASTR reduces raw file size by at least twofold; when combined with general-purpose compressors, it achieves higher compression ratios and accelerates compression by 1.75–4.8× and decompression by 1.75–2.34× compared to existing methods, substantially speeding up real-time genomic analysis pipelines.

Technology Category

Application Category

📝 Abstract
Motivation High-throughput sequencing (HTS) enables population-scale genomics but generates massive datasets, creating bottlenecks in storage, transfer, and analysis. FASTQ, the standard format for over two decades, stores one byte per base and one byte per quality score, leading to inefficient I/O, high storage costs, and redundancy. Existing compression tools can mitigate some issues, but often introduce costly decompression or complex dependency issues. Results We introduce FASTR, a lossless, computation-native successor to FASTQ that encodes each nucleotide together with its base quality score into a single 8-bit value. FASTR reduces file size by at least 2× while remaining fully reversible and directly usable for downstream analyses. Applying general-purpose compression tools on FASTR consistently yields higher compression ratios, 2.47, 3.64, and 4.8× faster compression, and 2.34, 1.96, 1.75× faster decompression than on FASTQ across Illumina, HiFi, and ONT reads. FASTR is machine-learning-ready, allowing reads to be consumed directly as numerical vectors or image-like representations. We provide a highly parallel software ecosystem for FASTQ–FASTR conversion and show that FASTR integrates with existing tools, such as minimap2, with minimal interface changes and no performance overhead. By eliminating decompression costs and reducing data movement, FASTR lays the foundation for scalable genomics analyses and real-time sequencing workflows. Availability and Implementation https://github.com/ALSER-Lab/FASTR
Problem

Research questions and friction points this paper is trying to address.

FASTQ
high-throughput sequencing
data compression
storage bottleneck
genomic data format
Innovation

Methods, ideas, or system contributions that make the work stand out.

FASTR
lossless compression
image-inspired representation
genomic data format
machine-learning-ready
🔎 Similar Papers
No similar papers found.
A
Adrian Tkachenko
ALSER Lab, Computational Life Sciences, Georgia State University, Atlanta, GA 30303, USA; Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA
S
Sepehr Salem
ALSER Lab, Computational Life Sciences, Georgia State University, Atlanta, GA 30303, USA; Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA
A
A. E. Adeniyi
Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA
Z
Zülal Bingöl
Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
M
Mohammed Nayeem Uddin
ALSER Lab, Computational Life Sciences, Georgia State University, Atlanta, GA 30303, USA; Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA
A
Akshat Prasanna
ALSER Lab, Computational Life Sciences, Georgia State University, Atlanta, GA 30303, USA
Alexander Zelikovsky
Alexander Zelikovsky
Distinguished University Professor at Georgia State University
algorithmscomputational genomicsbioinformaticsvlsi cadmolecular epidemiology
S
S. Mangul
Sage Bionetworks, Seattle, WA, USA; Department of Biological and Morphofunctional Sciences, College of Medicine and Biological Sciences, Stefan cel Mare University of Suceava, 720229 Suceava, Romania
Can Alkan
Can Alkan
Bilkent University, Ankara, Turkey
computational biologygenomicsbioinformaticscomputer sciencealgorithms
M
M. Alser
ALSER Lab, Computational Life Sciences, Georgia State University, Atlanta, GA 30303, USA; Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA; Department of Biological and Morphofunctional Sciences, College of Medicine and Biological Sciences, Stefan cel Mare University of Suceava, 720229 Suceava, Romania