🤖 AI Summary
To address the low efficiency of subgraph isomorphism retrieval in large-scale graph databases, this paper proposes a differentiable inverted index framework. It maps dense graph representations to discrete tokens—enabling text-like graph indexing for the first time—and integrates context-aware graph encoding, differentiable discretization, latent-vocabulary-driven binary coding, data-driven trainable impact weights, and multi-probe token expansion to support soft matching and joint precision-efficiency optimization. Experiments on multiple benchmarks demonstrate that our method significantly outperforms existing baselines: it achieves high recall while substantially reducing retrieval latency, thereby delivering superior precision–efficiency trade-offs.
📝 Abstract
Retrieving graphs from a large corpus, that contain a subgraph isomorphic to a given query graph, is a core operation in many real-world applications. While recent multi-vector graph representations and scores based on set alignment and containment can provide accurate subgraph isomorphism tests, their use in retrieval remains limited by their need to score corpus graphs exhaustively. We introduce CORGII (Contextual Representation of Graphs for Inverted Indexing), a graph indexing framework in which, starting with a contextual dense graph representation, a differentiable discretization module computes sparse binary codes over a learned latent vocabulary. This text document-like representation allows us to leverage classic, highly optimized inverted indices, while supporting soft (vector) set containment scores. Pushing this paradigm further, we replace the classical, fixed impact weight of a `token' on a graph (such as TFIDF or BM25) with a data-driven, trainable impact weight. Finally, we explore token expansion to support multi-probing the index for smoother accuracy-efficiency tradeoffs. To our knowledge, CORGII is the first indexer of dense graph representations using discrete tokens mapping to efficient inverted lists. Extensive experiments show that CORGII provides better trade-offs between accuracy and efficiency, compared to several baselines.