Proposal and study of statistical features for string similarity computation and classification

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work proposes a language- and syntax-agnostic approach to string similarity measurement by introducing co-occurrence matrices (COM) and run-length matrices (RLM)—concepts originally from image texture analysis—into the domain of string representation to construct purely statistical, language-independent features. The method integrates multiple statistical measures, including COM, RLM, longest common subsequence, and edit distance. Evaluated on synthetic datasets, COM and RLM significantly outperformed baseline methods in three out of four experiments (p < 0.001). In real-world text plagiarism detection tasks, RLM achieved the best performance, demonstrating the effectiveness and generalizability of the proposed statistical features.
📝 Abstract
Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.
Problem

Research questions and friction points this paper is trying to address.

string similarity
statistical features
language-independent
text classification
plagiarism detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

co-occurrence matrix
run-length matrix
string similarity
language-independent features
statistical text analysis
🔎 Similar Papers
No similar papers found.
E
E. O. Rodrigues
Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil
D
D. Casanova
Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil
M
M. Teixeira
Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil
V
V. Pegorini
Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil
F
F. Favarim
Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil
E
E. Clua
Department of Computer Science, Universidade Federal Fluminense (UFF), Rio de Janeiro, Brazil
A
A. Conci
Department of Computer Science, Universidade Federal Fluminense (UFF), Rio de Janeiro, Brazil
Panos Liatsis
Panos Liatsis
Professor, Khalifa University