Contextual Tokenization for Graph Inverted Indices

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

To address the low efficiency of subgraph isomorphism retrieval in large-scale graph databases, this paper proposes a differentiable inverted index framework. It maps dense graph representations to discrete tokens—enabling text-like graph indexing for the first time—and integrates context-aware graph encoding, differentiable discretization, latent-vocabulary-driven binary coding, data-driven trainable impact weights, and multi-probe token expansion to support soft matching and joint precision-efficiency optimization. Experiments on multiple benchmarks demonstrate that our method significantly outperforms existing baselines: it achieves high recall while substantially reducing retrieval latency, thereby delivering superior precision–efficiency trade-offs.

Technology Category

Application Category

📝 Abstract

Retrieving graphs from a large corpus, that contain a subgraph isomorphic to a given query graph, is a core operation in many real-world applications. While recent multi-vector graph representations and scores based on set alignment and containment can provide accurate subgraph isomorphism tests, their use in retrieval remains limited by their need to score corpus graphs exhaustively. We introduce CORGII (Contextual Representation of Graphs for Inverted Indexing), a graph indexing framework in which, starting with a contextual dense graph representation, a differentiable discretization module computes sparse binary codes over a learned latent vocabulary. This text document-like representation allows us to leverage classic, highly optimized inverted indices, while supporting soft (vector) set containment scores. Pushing this paradigm further, we replace the classical, fixed impact weight of a `token' on a graph (such as TFIDF or BM25) with a data-driven, trainable impact weight. Finally, we explore token expansion to support multi-probing the index for smoother accuracy-efficiency tradeoffs. To our knowledge, CORGII is the first indexer of dense graph representations using discrete tokens mapping to efficient inverted lists. Extensive experiments show that CORGII provides better trade-offs between accuracy and efficiency, compared to several baselines.

Problem

Research questions and friction points this paper is trying to address.

Retrieving graphs containing isomorphic subgraphs from large corpora

Overcoming exhaustive scoring limitations in graph similarity search

Enabling efficient inverted indexing for dense graph representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual dense graph representation with binary codes

Trainable impact weights replacing fixed token weights

Token expansion for multi-probing index efficiency

🔎 Similar Papers

Leveraging Contrastive Learning for Enhanced Node Representations in Tokenized Graph Transformers