EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational complexity of existing Transformer-based micro-expression recognition methods and their difficulty in learning effective representations from small-scale datasets. To overcome these limitations, we propose EPIR, an efficient micro-expression recognition framework that introduces several key innovations: dual-normalized shifted patch tokenization (DNSPT), a token fusion mechanism across cascaded Transformers, an enhanced attention mechanism, and a dynamic token selection module (DTSM). These components collectively reduce computational overhead while strengthening discriminative representation learning. Extensive experiments demonstrate that EPIR consistently outperforms state-of-the-art methods across four benchmark datasets—CASME II, SAMM, SMIC, and CAS(ME)³—with notable improvements of up to 9.6% in UF1 and 4.58% in UAR.
📝 Abstract
Micro-expression recognition can obtain the real emotion of the individual at the current moment. Although deep learning-based methods, especially Transformer-based methods, have achieved impressive results, these methods have high computational complexity due to the large number of tokens in the multi-head self-attention. In addition, the existing micro-expression datasets are small-scale, which makes it difficult for Transformer-based models to learn effective micro-expression representations. Therefore, we propose a novel Efficient Patch tokenization, Integration and Representation framework (EPIR), which can balance high recognition performance and low computational complexity. Specifically, we first propose a dual norm shifted tokenization (DNSPT) module to learn the spatial relationship between neighboring pixels in the face region, which is implemented by a refined spatial transformation and dual norm projection. Then, we propose a token integration module to integrate partial tokens among multiple cascaded Transformer blocks, thereby reducing the number of tokens without information loss. Furthermore, we design a discriminative token extractor, which first improves the attention in the Transformer block to reduce the unnecessary focus of the attention calculation on self-tokens, and uses the dynamic token selection module (DTSM) to select key tokens, thereby capturing more discriminative micro-expression representations. We conduct extensive experiments on four popular public datasets (i.e., CASME II, SAMM, SMIC, and CAS(ME)3. The experimental results show that our method achieves significant performance gains over the state-of-the-art methods, such as 9.6% improvement on the CAS(ME)$^3$ dataset in terms of UF1 and 4.58% improvement on the SMIC dataset in terms of UAR metric.
Problem

Research questions and friction points this paper is trying to address.

micro-expression recognition
computational complexity
small-scale datasets
token redundancy
Transformer-based methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

micro-expression recognition
efficient tokenization
Transformer optimization
token integration
dynamic token selection
🔎 Similar Papers
No similar papers found.
J
Junbo Wang
School of Software, Northwestern Polytechnical University, Xi’an, 710129, China
L
Liangyu Fu
School of Software, Northwestern Polytechnical University, Xi’an, 710129, China
Y
Yuke Li
School of Software, Northwestern Polytechnical University, Xi’an, 710129, China
Y
Yining Zhu
School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129, China
X
Xuecheng Wu
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China
Kun Hu
Kun Hu
Lecturer at Edith Cowan University
Multimedia ComputingGraphicsArtificial Intelligence