🤖 AI Summary
In CSI-based multi-person pose estimation, severe occlusion, inaccurate localization of highly mobile joints (e.g., wrists and elbows), and difficulty in modeling time-frequency features pose significant challenges. To address these, this paper proposes a non-intrusive multi-person pose estimation method based on a novel time-frequency dual-token Transformer. Its core contributions are: (1) the first time-frequency dual-token Transformer, jointly modeling the temporal dynamics and spectral structure of CSI signals; and (2) a multi-stage feature fusion network (MSFN) that deeply integrates CSI features with pose heatmaps while explicitly embedding anatomical constraints. Evaluated on the MM-Fi benchmark and a custom-built dataset, the method substantially outperforms state-of-the-art approaches—achieving a 12.7% average precision gain for highly mobile joints—demonstrating superior robustness and generalization capability.
📝 Abstract
Human pose estimation based on Channel State Information (CSI) has emerged as a promising approach for non-intrusive and precise human activity monitoring, yet faces challenges including accurate multi-person pose recognition and effective CSI feature learning. This paper presents MultiFormer, a wireless sensing system that accurately estimates human pose through CSI. The proposed system adopts a Transformer based time-frequency dual-token feature extractor with multi-head self-attention. This feature extractor is able to model inter-subcarrier correlations and temporal dependencies of the CSI. The extracted CSI features and the pose probability heatmaps are then fused by Multi-Stage Feature Fusion Network (MSFN) to enforce the anatomical constraints. Extensive experiments conducted on on the public MM-Fi dataset and our self-collected dataset show that the MultiFormer achieves higher accuracy over state-of-the-art approaches, especially for high-mobility keypoints (wrists, elbows) that are particularly difficult for previous methods to accurately estimate.