U-index: A Universal Indexing Framework for Matching Long Patterns

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing index structures for long-pattern matching struggle to simultaneously achieve space efficiency and fast query performance. Method: This paper proposes a generic text indexing framework based on a “sketch-then-verify” paradigm: lightweight sketches (e.g., min-hash) of the text and query rapidly filter candidate positions, followed by cache-friendly verification over short substrings of the original text. The framework decouples indexing logic from underlying data structures, supporting plug-and-play components such as suffix arrays, FM-indexes, and r-indexes. Contribution/Results: It is the first to unify three critical properties—minimal auxiliary space overhead, full retention of the original text, and efficient long-pattern matching—with theoretical guarantees of constant-time verification. Experiments show indexing construction accelerated by several-fold, space consumption reduced to 1/3–1/5 of conventional methods, and significant end-to-end speedups in bioinformatics tasks like long-read alignment.

Technology Category

Application Category

📝 Abstract
Text indexing is a fundamental and well-studied problem. Classic solutions either replace the original text with a compressed representation, e.g., the FM-index and its variants, or keep it uncompressed but attach some redundancy - an index - to accelerate matching. The former solutions thus retain excellent compressed space, but areslow in practice. The latter approaches, like the suffix array, instead sacrifice space for speed. We show that efficient text indexing can be achieved using just a small extra space on top of the original text, provided that the query patterns are sufficiently long. More specifically, we develop a new indexing paradigm in which a sketch of a query pattern is first matched against a sketch of the text. Once candidate matches are retrieved, they are verified using the original text. This paradigm is thus universal in the sense that it allows us to use any solution to index the sketched text, like a suffix array, FM-index, or r-index. We explore both the theory and the practice of this universal framework. With an extensive experimental analysis, we show that, surprisingly, universal indexes can be constructed much faster than their unsketched counterparts and take a fraction of the space, as a direct consequence of (i) having a lower bound on the length of patterns and (ii) working in sketch space. Furthermore, these data structures have the potential of retaining or even improving query time, because matching against the sketched text is faster and verifying candidates can be theoretically done in constant time per occurrence (or, in practice, by short and cache-friendly scans of the text). Finally, we discuss some important applications of this novel indexing paradigm to computational biology. We hypothesize that such indexes will be particularly effective when the queries are sufficiently long, and so demonstrate applications in long-read mapping.
Problem

Research questions and friction points this paper is trying to address.

Develops a universal indexing framework for long patterns.
Combines sketch-based matching with original text verification.
Optimizes space and speed for long-read mapping applications.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal indexing framework
Sketch-based pattern matching
Efficient long pattern queries
🔎 Similar Papers
No similar papers found.