🤖 AI Summary
This study addresses the uncertainty regarding whether long-range modeling architectures such as Transformer and Mamba outperform conventional CNN or CNN-LSTM models in wrist-based photoplethysmography (PPG)–based affective recognition, particularly under real-world conditions characterized by limited data and high noise. For the first time, Transformers and Mamba are introduced to PPG-based emotion recognition, with a unified preprocessing pipeline, segmentation strategy, and subject-independent 5-fold cross-validation protocol employed to systematically evaluate four model families on classifying arousal, valence, and relaxation states. Results demonstrate that CNNs achieve the highest accuracy and remain the most lightweight; Transformers exhibit more balanced F1 scores for arousal and relaxation; and Mamba performs comparably to Transformers but does not consistently surpass CNNs. These findings provide empirically grounded guidance for model selection in wearable affective computing systems.
📝 Abstract
Photoplethysmography (PPG) is increasingly used in wearable affective computing due to its low cost and ease of integration into consumer devices. Recent advances in deep learning have introduced long-range sequence models, such as Transformers, and state-space models, like Mamba, which have demonstrated strong performance on natural language and general time-series tasks. However, it remains unclear whether these architectures offer tangible benefits over widely used Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTMs) for PPG-based affect recognition, given that datasets are typically small and noisy. This work presents a measurement-driven comparison of four deep learning architectures, CNN, CNN-LSTM hybrid, Transformers, and Mamba, for classifying arousal, valence, and relaxation states from wrist-based PPG signals. All models are evaluated under a subject-independent 5-fold cross-validation protocol using identical preprocessing, segmentation, and training pipelines. Our results show that the Transformer and Mamba models achieve performance comparable to that of a CNN baseline, but do not consistently outperform it across all tasks. CNNs remain the most effective overall, providing the highest accuracy with the smallest model size, whereas Transformers have a better balance of F1 scores for Arousal and Relaxation. The study provides the first evaluation of Transformer and Mamba models for PPG-based affect recognition, offering practical guidance on model selection for wearable affective monitoring systems.