Explainable identification of similarities between entities for discovery in large text

📅 2025-03-22

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing text similarity tools—including large language models—struggle to distinguish superficial lexical overlap from genuine semantic similarity among underlying entities. To address this, we propose a non-parametric similarity analysis framework based on weighted n-grams. Our method incorporates a language-frequency penalty to correct statistical biases in English corpora, ensuring similarity scores reflect semantically related entities rather than surface-level word repetition. All computational steps are fully traceable and interpretable, and results support visualization (e.g., word clouds) for empirical validation. Extensive experiments across diverse domains—including biographies, scientific literature, and historical texts—demonstrate that the framework consistently identifies deep, cross-document entity-level semantic similarity. Results are deterministic and fully reproducible. An open-source implementation is publicly available.

Technology Category

Application Category

📝 Abstract

With the availability of virtually infinite number text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Automatically compare text documents for meaningful insights

Provide explainable textual similarities beyond surface-level comparisons

Develop n-gram framework to uncover subject-level document similarities

Innovation

Methods, ideas, or system contributions that make the work stand out.

N-gram analysis framework for document comparison

Weighted scoring formula for n-gram frequency

Visualization tools to enhance similarity insights

🔎 Similar Papers

From Latent to Lucid: Transforming Knowledge Graph Embeddings into Interpretable Structures