🤖 AI Summary
Micro-expression analysis faces two key challenges: (1) fixed-window sliding classification fails to accommodate micro-expressions’ transient nature and variable duration; and (2) spotting and recognition are conventionally modeled separately, neglecting their intrinsic coupling. To address these, we propose ME-TST+, the first video-level end-to-end regression framework that jointly models spotting and recognition via a temporal state transition mechanism. Methodologically, it integrates three innovations: dynamic temporal modeling based on state-space models, multi-granularity region-of-interest (ROI) feature extraction, and a dual-path (slow-fast) Mamba architecture, with collaborative optimization at both feature and decision levels. Evaluated on benchmark datasets—including CASME III and SAMM—ME-TST+ achieves state-of-the-art performance, significantly improving spotting precision, recognition accuracy, and robustness across diverse micro-expression durations.
📝 Abstract
Micro-expressions (MEs) are regarded as important indicators of an individual's intrinsic emotions, preferences, and tendencies. ME analysis requires spotting of ME intervals within long video sequences and recognition of their corresponding emotional categories. Previous deep learning approaches commonly employ sliding-window classification networks. However, the use of fixed window lengths and hard classification presents notable limitations in practice. Furthermore, these methods typically treat ME spotting and recognition as two separate tasks, overlooking the essential relationship between them. To address these challenges, this paper proposes two state space model-based architectures, namely ME-TST and ME-TST+, which utilize temporal state transition mechanisms to replace conventional window-level classification with video-level regression. This enables a more precise characterization of the temporal dynamics of MEs and supports the modeling of MEs with varying durations. In ME-TST+, we further introduce multi-granularity ROI modeling and the slowfast Mamba framework to alleviate information loss associated with treating ME analysis as a time-series task. Additionally, we propose a synergy strategy for spotting and recognition at both the feature and result levels, leveraging their intrinsic connection to enhance overall analysis performance. Extensive experiments demonstrate that the proposed methods achieve state-of-the-art performance. The codes are available at https://github.com/zizheng-guo/ME-TST.