🤖 AI Summary
This work addresses the limitations of traditional Euclidean distance–based embedding retrieval in capturing complex semantic paths and local geometric structures within citation graphs. The authors propose a geometry-aware semantic retrieval method that learns a low-rank Riemannian metric at each node to construct local positive semi-definite metric spaces. Coarse-grained retrieval is performed via a geodesic distance–driven, multi-source Dijkstra algorithm, yielding path-interpretable results, followed by fine-grained reranking using Maximal Marginal Relevance (MMR) and path consistency filtering. Innovatively integrating learnable node-level Riemannian metrics into citation retrieval, the approach employs a hierarchical search strategy that substantially reduces computational overhead. Evaluated on a benchmark of 169,000 papers, the method achieves a 23% improvement in Recall@20 over SPECTER+FAISS, cuts computational cost by 75% through hierarchical search, and retains 97% of retrieval quality.
📝 Abstract
We present Geodesic Semantic Search (GSS), a retrieval system that learns node-specific Riemannian metrics on citation graphs to enable geometry-aware semantic search. Unlike standard embedding-based retrieval that relies on fixed Euclidean distances, \gss{} learns a low-rank metric tensor $\mL_i \in \R^{d \times r}$ at each node, inducing a local positive semi-definite metric $\mG_i = \mL_i \mL_i^\top + \eps \mI$. This parameterization guarantees valid metrics while keeping the model tractable. Retrieval proceeds via multi-source Dijkstra on the learned geodesic distances, followed by Maximal Marginal Relevance reranking and path coherence filtering. On citation prediction benchmarks with 169K papers, \gss{} achieves 23\% relative improvement in Recall@20 over SPECTER+FAISS baselines while providing interpretable citation paths. Our hierarchical coarse-to-fine search with k-means pooling reduces computational cost by 4$\times$ compared to flat geodesic search while maintaining 97\% retrieval quality. We provide theoretical analysis of when geodesic distances outperform direct similarity, characterize the approximation quality of low-rank metrics, and validate predictions empirically. Code and trained models are available at https://github.com/YCRG-Labs/geodesic-search.