🤖 AI Summary
Existing text similarity tools—including large language models—struggle to distinguish superficial lexical overlap from genuine semantic similarity among underlying entities. To address this, we propose a non-parametric similarity analysis framework based on weighted n-grams. Our method incorporates a language-frequency penalty to correct statistical biases in English corpora, ensuring similarity scores reflect semantically related entities rather than surface-level word repetition. All computational steps are fully traceable and interpretable, and results support visualization (e.g., word clouds) for empirical validation. Extensive experiments across diverse domains—including biographies, scientific literature, and historical texts—demonstrate that the framework consistently identifies deep, cross-document entity-level semantic similarity. Results are deterministic and fully reproducible. An open-source implementation is publicly available.
📝 Abstract
With the availability of virtually infinite number text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.