🤖 AI Summary
This study investigates whether pretrained vision models can accurately predict human fear responses to spider images—supporting emotion-adaptive computerized exposure therapy. We evaluated ResNet, ViT, and ConvNeXt using transfer learning and five-fold cross-validation to regress fear scores (0–100) on a dataset of 313 annotated spider images. To enhance interpretability, we incorporated attention visualization and class-level error analysis, revealing that models attend to semantically meaningful features such as spider morphology and posture. The best-performing model achieved a mean absolute error of 10.1; learning curves indicate performance saturation at the current data scale. Our key contributions are threefold: (1) the first systematic validation of general-purpose vision models for fine-grained fear response regression; (2) an interpretability-driven framework for emotion-aware image assessment; and (3) empirical evidence that both dataset scale and feature interpretability are critically interdependent for clinical AI reliability.
📝 Abstract
Advances in computer vision have opened new avenues for clinical applications, particularly in computerized exposure therapy where visual stimuli can be dynamically adjusted based on patient responses. As a critical step toward such adaptive systems, we investigated whether pretrained computer vision models can accurately predict fear levels from spider-related images. We adapted three diverse models using transfer learning to predict human fear ratings (on a 0-100 scale) from a standardized dataset of 313 images. The models were evaluated using cross-validation, achieving an average mean absolute error (MAE) between 10.1 and 11.0. Our learning curve analysis revealed that reducing the dataset size significantly harmed performance, though further increases yielded no substantial gains. Explainability assessments showed the models' predictions were based on spider-related features. A category-wise error analysis further identified visual conditions associated with higher errors (e.g., distant views and artificial/painted spiders). These findings demonstrate the potential of explainable computer vision models in predicting fear ratings, highlighting the importance of both model explainability and a sufficient dataset size for developing effective emotion-aware therapeutic technologies.