Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional heuristic sequence alignment methods suffer from poor scalability, high computational cost, and weak detection capability for distantly related sequences amid the explosive growth of DNA sequence data. Method: We propose a scalable analysis framework leveraging bioinformatics-specific pretrained gene embedding models to map sequences into low-dimensional vectors, integrated with FAISS and ScaNN for efficient approximate nearest neighbor search. Contributions/Results: (1) Our approach decouples homology inference from explicit sequence similarity, significantly improving detection of novel taxonomic groups and orphan genes lacking homologs; (2) It achieves comparable functional similarity discrimination accuracy while outperforming mainstream tools in retrieval speed and memory efficiency. Experimental evaluation demonstrates its practicality and scalability for large-scale genome functional annotation and homology detection.

Technology Category

Application Category

📝 Abstract
The exponential growth of DNA sequencing data has outpaced traditional heuristic-based methods, which struggle to scale effectively. Efficient computational approaches are urgently needed to support large-scale similarity search, a foundational task in bioinformatics for detecting homology, functional similarity, and novelty among genomic and proteomic sequences. Although tools like BLAST have been widely used and remain effective in many scenarios, they suffer from limitations such as high computational cost and poor performance on divergent sequences. In this work, we explore embedding-based similarity search methods that learn latent representations capturing deeper structural and functional patterns beyond raw sequence alignment. We systematically evaluate two state-of-the-art vector search libraries, FAISS and ScaNN, on biologically meaningful gene embeddings. Unlike prior studies, our analysis focuses on bioinformatics-specific embeddings and benchmarks their utility for detecting novel sequences, including those from uncharacterized taxa or genes lacking known homologs. Our results highlight both computational advantages (in memory and runtime efficiency) and improved retrieval quality, offering a promising alternative to traditional alignment-heavy tools.
Problem

Research questions and friction points this paper is trying to address.

Addressing scalability issues in DNA sequence similarity search
Improving computational efficiency for large-scale gene embedding search
Enhancing detection of novel sequences with divergent homology
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses FAISS and ScaNN for gene embedding search
Focuses on bioinformatics-specific embeddings evaluation
Improves retrieval quality and computational efficiency
🔎 Similar Papers
No similar papers found.
Mohammad Saleh Refahi
Mohammad Saleh Refahi
Research Assistant at Drexel University
Machine learningDeep LearningComputational biology
G
Gavin Hearne
Department of Electrical and Computer Engineering, Drexel University, Philadelphia, USA
H
Harrison Muller
Department of Electrical and Computer Engineering, Drexel University, Philadelphia, USA
K
Kieran Lynch
Department of Electrical and Computer Engineering, Drexel University, Philadelphia, USA
B
Bahrad A. Sokhansanj
Department of Electrical and Computer Engineering, Drexel University, Philadelphia, USA
J
James R. Brown
Department of Electrical and Computer Engineering, Drexel University, Philadelphia, USA
Gail Rosen
Gail Rosen
Professor of ECE, Drexel University
BioinformaticsMetagenomicsGenomic Signal Processing