🤖 AI Summary
This work resolves a long-standing open problem: constructing co-lexicographic (co-lex) indexes on nondeterministic finite automata (NFAs) by compactly encoding the coarsest forward-stable co-lex order (CFS) in linear space. Prior CFS indexes were restricted to deterministic finite automata (DFAs), incurred quadratic construction time, and suffered from high space overhead. We present the first explicit encoding of the CFS order using only $O(|Q|)$ space—where $|Q|$ is the number of states—thereby lifting the DFA restriction. Our method integrates automata theory, partial order analysis, and forward-stability-driven state compression to achieve a linear-space representation of CFS. This breakthrough enables near-linear-time pattern indexing on NFAs, significantly enhancing scalability and query efficiency for large-scale graph-structured data, such as pangenomes.
📝 Abstract
The Burrows-Wheeler transform (BWT) is a string transformation that enhances string indexing and compressibility. Cotumaccio and Prezza [SODA '21] extended this transformation to nondeterministic finite automata (NFAs) through co-lexicographic partial orders, i.e., by sorting the states of an NFA according to the co-lexicographic order of the strings reaching them. As the BWT of an NFA shares many properties with its original string variant, the transformation can be used to implement indices for locating specific patterns on the NFA itself. The efficiency of the resulting index is influenced by the width of the partial order on the states: the smaller the width, the faster the index. The most efficient index for arbitrary NFAs currently known in the literature is based on the coarsest forward-stable co-lex (CFS) order of Becker et al. [SPIRE '24]. In this paper, we prove that this CFS order can be encoded within linear space in the number of states in the automaton. The importance of this result stems from the fact that encoding such an order in linear space represents a big first step in the direction of building the index based on this order in near-linear time -- the biggest open research question in this context. The currently most efficient known algorithm for this task run in quadratic time in the number of transitions in the NFA and are thus infeasible to be run on very large graphs (e.g., pangenome graphs). At this point, a near-linear time algorithm is solely known for the simpler case of deterministic automata [Becker et al., ESA '23] and, in fact, this algorithmic result was enabled by a linear space encoding for deterministic automata [Kim et al., CPM '23].