π€ AI Summary
Substructure search over large-scale JSONL datasets incurs prohibitive computational overhead with conventional tree traversal and subtree matching, hindering real-time applications such as foundation model prompt engineering.
Method: We propose an efficient indexing and search framework: (1) modeling multi-object JSONL as a fused tree structure; (2) introducing an xBWT-based compact representation enabling fast path decomposition and ancestor relationship inference; and (3) designing a three-stage adaptive search algorithm that avoids exhaustive traversal.
Contribution/Results: Experiments on real-world datasets demonstrate a 4,700Γ speedup over traditional tree search and a 6-millionΓ improvement over XML baselines, while maintaining competitive memory footprint. The core innovations are an xBWT-driven compressed tree representation and an adaptive substructure localization mechanism, jointly enabling scalable, low-latency JSONL substructure search.
π Abstract
Substructure search in JSON Lines (JSONL) datasets is essential for modern applications such as prompt engineering in foundation models, but existing methods suffer from prohibitive computational costs due to exhaustive tree traversal and subtree matching. We present jXBW, a fast method for substructure search on large-scale JSONL datasets. Our method makes three key technical contributions: (i) a merged tree representation built by merging trees of multiple JSON objects while preserving individual identities, (ii) a succinct data structure based on the eXtended Burrows-Wheeler Transform that enables efficient tree navigation and subpath search, and (iii) an efficient three-step substructure search algorithm that combines path decomposition, ancestor computation, and adaptive tree identifier collection to ensure correctness while avoiding exhaustive tree traversal. Experimental evaluation on real-world datasets demonstrates that jXBW consistently outperforms existing methods, achieving speedups of 16$ imes$ for smaller datasets and up to 4,700$ imes$ for larger datasets over tree-based approaches, and more than 6$ imes$10$^6$ over XML-based processing while maintaining competitive memory usage.