🤖 AI Summary
Existing WiFi-based human pose estimation methods rely on camera-coordinate supervision, making them prone to overfitting specific device layouts and limiting their generalization. This work proposes PerceptAlign, a framework that aligns WiFi and visual spaces through a lightweight geometry-aware coordinate unification procedure requiring only two checkerboards and a few photographs. The calibrated transceiver positions are encoded into high-dimensional geometric embeddings and fused with channel state information (CSI) features to enable layout-agnostic 3D pose estimation. PerceptAlign introduces, for the first time, a geometry-conditioned learning mechanism that effectively disentangles human motion from device layout. Evaluated on the largest cross-domain WiFi pose dataset to date, the method reduces in-domain error by 12.3% and achieves over 60% reduction in cross-domain error.
📝 Abstract
WiFi-based 3D human pose estimation offers a low-cost and privacy-preserving alternative to vision-based systems for smart interaction. However, existing approaches rely on visual 3D poses as supervision and directly regress CSI to a camera-based coordinate system. We find that this practice leads to coordinate overfitting: models memorize deployment-specific WiFi transceiver layouts rather than only learning activity-relevant representations, resulting in severe generalization failures. To address this challenge, we present PerceptAlign, the first geometry-conditioned framework for WiFi-based cross-layout pose estimation. PerceptAlign introduces a lightweight coordinate unification procedure that aligns WiFi and vision measurements in a shared 3D space using only two checkerboards and a few photos. Within this unified space, it encodes calibrated transceiver positions into high-dimensional embeddings and fuses them with CSI features, making the model explicitly aware of device geometry as a conditional variable. This design forces the network to disentangle human motion from deployment layouts, enabling robust and, for the first time, layout-invariant WiFi pose estimation. To support systematic evaluation, we construct the largest cross-domain 3D WiFi pose estimation dataset to date, comprising 21 subjects, 5 scenes, 18 actions, and 7 device layouts. Experiments show that PerceptAlign reduces in-domain error by 12.3% and cross-domain error by more than 60% compared to state-of-the-art baselines. These results establish geometry-conditioned learning as a viable path toward scalable and practical WiFi sensing.