🤖 AI Summary
This study addresses the critical challenge of acoustic noise degrading the reliability of emotion recognition and voice pathology detection in medical speech analysis. It presents the first systematic evaluation of the robustness of quantum convolutional neural networks (QNNs) under non-adversarial acoustic distortions, benchmarking their performance against classical CNN architectures—including CNN-Base, ResNet-18, and VGG-16—under a clean-training/noisy-testing paradigm. The experiments encompass four types of perturbations and are assessed using metrics such as classification error (CE), mean CE (mCE), and convergence speed. Results demonstrate that a shallow entangled quantum front-end significantly enhances noise resilience: under pitch, temporal, and speaking-rate perturbations, the QNN reduces error rates by up to 22% compared to CNN-Base and achieves up to sixfold faster convergence. However, it remains relatively sensitive to Gaussian noise.
📝 Abstract
Speech-based machine learning systems are sensitive to noise, complicating reliable deployment in emotion recognition and voice pathology detection. We evaluate the robustness of a hybrid quantum machine learning model, quanvolutional neural networks (QNNs) against classical convolutional neural networks (CNNs) under four acoustic corruptions (Gaussian noise, pitch shift, temporal shift, and speed variation) in a clean-train/corrupted-test regime. Using AVFAD (voice pathology) and TESS (speech emotion), we compare three QNN models (Random, Basic, Strongly) to a simple CNN baseline (CNN-Base), ResNet-18 and VGG-16 using accuracy and corruption metrics (CE, mCE, RCE, RmCE), and analyze architectural factors (circuit complexity or depth, convergence) alongside per-emotion robustness. QNNs generally outperform the CNN-Base under pitch shift, temporal shift, and speed variation (up to 22% lower CE/RCE at severe temporal shift), while the CNN-Base remains more resilient to Gaussian noise. Among quantum circuits, QNN-Basic achieves the best overall robustness on AVFAD, and QNN-Random performs strongest on TESS. Emotion-wise, fear is most robust (80-90% accuracy under severe corruptions), neutral can collapse under strong Gaussian noise (5.5% accuracy), and happy is most vulnerable to pitch, temporal, and speed distortions. QNNs also converge up to six times faster than the CNN-Base. To our knowledge, this is a systematic study of QNN robustness for speech under common non-adversarial acoustic corruptions, indicating that shallow entangling quantum front-ends can improve noise resilience while sensitivity to additive noise remains a challenge.