🤖 AI Summary
Existing skeleton-based gait recognition methods suffer from degraded performance under appearance variations—such as carrying objects or wearing coats—due to insufficient explicit modeling of motion dynamics. To address this, this work proposes a plug-and-play Wavelet Feature Stream that, for the first time, incorporates the Continuous Wavelet Transform (CWT) into skeleton sequence modeling. The approach converts joint velocities into multi-scale time-frequency representations and employs a lightweight multi-scale CNN to extract dynamic cues, which are then fused with features from the backbone network. Notably, this method requires neither architectural modifications to the backbone nor additional supervision, yet significantly enhances robustness under covariate shift. When integrated with strong backbones such as GaitMixer, it achieves state-of-the-art performance on the CASIA-B dataset, with particularly pronounced gains in challenging scenarios like BG (Backpack) and CL (Coat).
📝 Abstract
Skeleton-based gait recognizers excel at modeling spatial configurations but often underuse explicit motion dynamics that are crucial under appearance changes. We introduce a plug-and-play Wavelet Feature Stream that augments any skeleton backbone with time-frequency dynamics of joint velocities. Concretely, per-joint velocity sequences are transformed by the continuous wavelet transform (CWT) into multi-scale scalograms, from which a lightweight multi-scale CNN learns discriminative dynamic cues. The resulting descriptor is fused with the backbone representation for classification, requiring no changes to the backbone architecture or additional supervision. Across CASIA-B, the proposed stream delivers consistent gains on strong skeleton backbones (e.g., GaitMixer, GaitFormer, GaitGraph) and establishes a new skeleton-based state of the art when attached to GaitMixer. The improvements are especially pronounced under covariate shifts such as carrying bags (BG) and wearing coats (CL), highlighting the complementarity of explicit time-frequency modeling and standard spatio-temporal encoders.