jXBW: Fast Substructure Search in Large-Scale JSONL Datasets for Foundation Model Applications

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Substructure search over large-scale JSONL datasets incurs prohibitive computational overhead with conventional tree traversal and subtree matching, hindering real-time applications such as foundation model prompt engineering. Method: We propose an efficient indexing and search framework: (1) modeling multi-object JSONL as a fused tree structure; (2) introducing an xBWT-based compact representation enabling fast path decomposition and ancestor relationship inference; and (3) designing a three-stage adaptive search algorithm that avoids exhaustive traversal. Contribution/Results: Experiments on real-world datasets demonstrate a 4,700× speedup over traditional tree search and a 6-million× improvement over XML baselines, while maintaining competitive memory footprint. The core innovations are an xBWT-driven compressed tree representation and an adaptive substructure localization mechanism, jointly enabling scalable, low-latency JSONL substructure search.

Technology Category

Application Category

📝 Abstract

Substructure search in JSON Lines (JSONL) datasets is essential for modern applications such as prompt engineering in foundation models, but existing methods suffer from prohibitive computational costs due to exhaustive tree traversal and subtree matching. We present jXBW, a fast method for substructure search on large-scale JSONL datasets. Our method makes three key technical contributions: (i) a merged tree representation built by merging trees of multiple JSON objects while preserving individual identities, (ii) a succinct data structure based on the eXtended Burrows-Wheeler Transform that enables efficient tree navigation and subpath search, and (iii) an efficient three-step substructure search algorithm that combines path decomposition, ancestor computation, and adaptive tree identifier collection to ensure correctness while avoiding exhaustive tree traversal. Experimental evaluation on real-world datasets demonstrates that jXBW consistently outperforms existing methods, achieving speedups of 16$ imes$ for smaller datasets and up to 4,700$ imes$ for larger datasets over tree-based approaches, and more than 6$ imes$10$^6$ over XML-based processing while maintaining competitive memory usage.

Problem

Research questions and friction points this paper is trying to address.

Efficient substructure search in large JSONL datasets

Reduce computational costs of tree traversal and matching

Optimize search for foundation model applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Merged tree representation for multiple JSON objects

Succinct data structure with eXtended Burrows-Wheeler Transform

Three-step substructure search algorithm for efficiency

🔎 Similar Papers

No similar papers found.