Impact of Data Breadth and Depth on Performance of Siamese Neural Network Model: Experiments with Three Keystroke Dynamic Datasets

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the impact of data breadth (number of subjects) versus depth (per-subject sample size and keystroke sequence length) on Siamese network performance in keystroke dynamics authentication. Using three public datasets—Aalto, CMU, and Clarkson II—we employ triplet-loss training, feature-space density analysis, and comprehensive ablation studies across multiple dimensions. We quantitatively demonstrate that increasing subject count markedly improves cross-user generalization, whereas the effect of per-subject data depth is text-type dependent: free-text authentication performance is jointly constrained by both sample size and sequence length, whereas fixed-text scenarios exhibit greater robustness. Crucially, we establish that expanding the number of subjects yields greater accuracy gains than increasing per-subject data volume. Our core contribution is a data configuration optimization principle for behavioral biometric modeling, grounded in empirical evidence of the trade-off between dataset scale and authentication accuracy—providing practitioners with reproducible, deployment-oriented guidance for balancing resource constraints against system performance.

Technology Category

Application Category

📝 Abstract
Deep learning models, such as the Siamese Neural Networks (SNN), have shown great potential in capturing the intricate patterns in behavioral data. However, the impacts of dataset breadth (i.e., the number of subjects) and depth (e.g., the amount of training samples per subject) on the performance of these models is often informally assumed, and remains under-explored. To this end, we have conducted extensive experiments using the concepts of"feature space"and"density"to guide and gain deeper understanding on the impact of dataset breadth and depth on three publicly available keystroke datasets (Aalto, CMU and Clarkson II). Through varying the number of training subjects, number of samples per subject, amount of data in each sample, and number of triplets used in training, we found that when feasible, increasing dataset breadth enables the training of a well-trained model that effectively captures more inter-subject variability. In contrast, we find that the extent of depth's impact from a dataset depends on the nature of the dataset. Free-text datasets are influenced by all three depth-wise factors; inadequate samples per subject, sequence length, training triplets and gallery sample size, which may all lead to an under-trained model. Fixed-text datasets are less affected by these factors, and as such make it easier to create a well-trained model. These findings shed light on the importance of dataset breadth and depth in training deep learning models for behavioral biometrics and provide valuable insights for designing more effective authentication systems.
Problem

Research questions and friction points this paper is trying to address.

Siamese Neural Network
Typing Rhythm Recognition
Data Set Characteristics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Siamese Neural Networks
Typing Rhythm Dataset
Behavioral Biometrics
🔎 Similar Papers
No similar papers found.
A
A. Wahab
Electrical and Computer Engineering, Clarkson University, Potsdam, NY, USA; Software Engineering, RIT, Rochester, NY, USA
Daqing Hou
Daqing Hou
Rochester Institute of Technology
Software EngineeringCybersecurityBehavioral BiometricsEducation ResearchSmart Energy
N
Nadia Cheng
Statistics, University of Virginia, Charlottesville, VA, USA
P
Parker Huntley
College of Computing, Georgia Tech, Atlanta, GA, USA
C
Charles Devlen
Computing and Information Science, RIT, Rochester, NY, USA