🤖 AI Summary
This work addresses the keyword dictionary expansion task by proposing a semantic neighborhood discovery method that integrates manifold learning with graph diffusion. Unlike conventional pairwise similarity or co-occurrence statistics, our approach first constructs a local similarity graph based on the geometric structure of word embeddings; it then enhances nonlinear semantic connectivity via heat-kernel-based graph diffusion and combines local community detection with semantic neighborhood aggregation to identify highly cohesive, interpretable semantic clusters. The key contribution is the first synergistic modeling of the intrinsic nonlinear structure of word embeddings through joint manifold learning and localized graph diffusion. Evaluated on two user-generated corpora and a real-world communication science scenario—conspiracy-theory lexicon expansion—our method significantly outperforms mainstream baselines. Expert evaluation confirms that the expanded terms exhibit superior semantic relevance and domain-specific utility.
📝 Abstract
We present Local Graph-based Dictionary Expansion (LGDE), a method for data-driven discovery of the semantic neighbourhood of words using tools from manifold learning and network science. At the heart of LGDE lies the creation of a word similarity graph from the geometry of word embeddings followed by local community detection based on graph diffusion. The diffusion in the local graph manifold allows the exploration of the complex nonlinear geometry of word embeddings to capture word similarities based on paths of semantic association, over and above direct pairwise similarities. Exploiting such semantic neighbourhoods enables the expansion of dictionaries of pre-selected keywords, an important step for tasks in information retrieval, such as database queries and online data collection. We validate LGDE on two user-generated English-language corpora and show that LGDE enriches the list of keywords with improved performance relative to methods based on direct word similarities or co-occurrences. We further demonstrate our method through a real-world use case from communication science, where LGDE is evaluated quantitatively on the expansion of a conspiracy-related dictionary from online data collected and analysed by domain experts. Our empirical results and expert user assessment indicate that LGDE expands the seed dictionary with more useful keywords due to the manifold-learning-based similarity network.