Mining Negative Sequential Patterns to Improve Viral Genomic Feature Representation and Classification

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Existing methods for viral genome classification predominantly rely on compositional or frequency-based features, often suffering from poor interpretability and limited performance on complex or imbalanced datasets. This work proposes GeneNSPCla, a novel framework that introduces negative sequence patterns (NSPs) into RNA virus classification for the first time. It presents GONPM+, an algorithm specifically adapted for genomic data, which efficiently mines longer and biologically meaningful absence patterns. These NSPs are encoded into numerical features and integrated with multiple supervised classifiers, jointly modeling both presence and absence signals in sequences. Experimental results demonstrate that, across eight classifiers, GONPM+ achieves an average accuracy improvement of 10.03% over baseline negative pattern mining methods and 24.75% over positive-pattern-based approaches, substantially enhancing both classification performance and model interpretability.
📝 Abstract
Viruses represent the most abundant biological entities on Earth and play a pivotal role in microbial ecosystems, yet, as prominent human pathogens, they are closely linked to human morbidity and mortality. Accurate identification of viral sequences from viral genome sequences is therefore essential, but existing genome-based classification models that largely relying on composition- or frequency-based subsequence features often suffer from limited interpretability and reduced accuracy, particularly on complex or imbalanced datasets. To address these limitations, we propose GeneNSPCla (Genomic Negative Sequential Pattern-based Classification), a novel viral classification framework based on Negative Sequential Patterns (NSPs) that extracts discriminative absence-based features from nucleotide sequences of RNA viral genomes. By transforming these NSPs into numerical feature vectors and integrating them into multiple supervised classifiers, GeneNSPCla effectively captures both presence and absence signals in viral sequences. Furthermore, we propose a negative pattern mining algorithm adapted for processing genomic data: GONPM+, which can discover longer and more biologically meaningful negative sequential patterns. The experimental results demonstrate that the average accuracy of GONPM+ in 8 classifiers has improved by 10.03% compared to the original negative pattern mining algorithm and by 24.75% compared to the positive pattern mining algorithm. These findings highlight the effectiveness of incorporating absence-based sequential information, providing a new and complementary perspective for viral genome analysis and classification.
Problem

Research questions and friction points this paper is trying to address.

viral classification
genomic feature representation
negative sequential patterns
sequence-based features
imbalanced datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Negative Sequential Patterns
Viral Genome Classification
Absence-based Features
GONPM+
Genomic Feature Representation
🔎 Similar Papers
No similar papers found.