MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition

πŸ“… 2026-04-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the challenges of time-consuming and error-prone manual annotation in rodent social behavior recognition by proposing a lightweight multi-scale global–local Transformer model. The method explicitly captures behavioral dynamics across multiple temporal scales through parallel short-range, mid-range, and global attention branches, and incorporates a Behavior-Aware Modulation (BAM) module to enhance discriminative feature representation. As the first approach to achieve cross-dataset generalization within a unified architecture without task-specific fine-tuning, the model attains 75.4% accuracy (F1 = 0.745) on RatSI and 87.1% accuracy (F1 = 0.8745) on CalMS21, significantly outperforming prevailing methods such as TCN, LSTM, and ST-GCN.
πŸ“ Abstract
Recognition of rodent behavior is important for understanding neural and behavioral mechanisms. Traditional manual scoring is time-consuming and prone to human error. We propose MSGL-Transformer, a Multi-Scale Global-Local Transformer for recognizing rodent social behaviors from pose-based temporal sequences. The model employs a lightweight transformer encoder with multi-scale attention to capture motion dynamics across different temporal scales. The architecture integrates parallel short-range, medium-range, and global attention branches to explicitly capture behavior dynamics at multiple temporal scales. We also introduce a Behavior-Aware Modulation (BAM) block, inspired by SE-Networks, which modulates temporal embeddings to emphasize behavior-relevant features prior to attention. We evaluate on two datasets: RatSI (5 behavior classes, 12D pose inputs) and CalMS21 (4 behavior classes, 28D pose inputs). On RatSI, MSGL-Transformer achieves 75.4% mean accuracy and F1-score of 0.745 across nine cross-validation splits, outperforming TCN, LSTM, and Bi-LSTM. On CalMS21, it achieves 87.1% accuracy and F1-score of 0.8745, a +10.7% improvement over HSTWFormer, and outperforms ST-GCN, MS-G3D, CTR-GCN, and STGAT. The same architecture generalizes across both datasets with only input dimensionality and number of classes adjusted.
Problem

Research questions and friction points this paper is trying to address.

rodent social behavior recognition
pose-based temporal sequences
behavior dynamics
multi-scale modeling
automated behavior analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Scale Attention
Global-Local Transformer
Behavior-Aware Modulation
Pose-Based Behavior Recognition
Temporal Dynamics Modeling
πŸ”Ž Similar Papers
No similar papers found.
M
Muhammad Imran Sharif
Department of Computer Science, Kansas State University, Manhattan, KS, 66506, USA
Doina Caragea
Doina Caragea
Kansas State University
deep learningtext miningdata miningdata science