CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the performance bottleneck in isolated hand gesture recognition using only 60 GHz radar amplitude information, which suffers from signal compression and loss of spatial details. To overcome this limitation, the authors propose CAST, a dual-stream architecture that innovatively applies a decibel-to-linear inverse transformation to recover the original signal’s dynamic range. The framework further incorporates a cross-antenna spatial attention mechanism and an asymmetric cross-attention fusion strategy to enhance radar-based gesture representations. Integrated with Cadence velocity maps, windowed FFT, and dual backbones—ConvNeXt-Tiny and EfficientNetV2-S—the system achieves a Top-1 accuracy of 80.5% under strict radar-exclusive constraints, as validated by five-fold cross-validation, outperforming the best single-model baseline by 3.3%.
📝 Abstract
We propose CAST, a dual-stream architecture that utilizes channel-aware spatial transfer learning for isolated sign language recognition addressing the challenges of magnitude-only 60~GHz radar Range-Time Maps (RTM). The proposed framework combines three physics-aware architectures with pretrained vision backbones, which operate under radar-only constraints across clinical and alphabetical gestures. First, an explicit decibel-to-linear inversion is combined with a windowed fast Fourier transform that extracts Cadence Velocity Diagrams (CVD) while avoiding the harmonic artifacts that arise from the spectral analysis of log-compressed signals. Second, a cross-antenna spatial attention module applies attention to raw antenna channels before the convolution, preserving inter-receiver amplitude covariance. Third, an asymmetric cross-attention mechanism fuses representations from parallel ConvNeXt-Tiny (CVD) and EfficientNetV2-S (RTM) backbones. Extensive experiments reveal that the architecture achieves a Top-1 accuracy of 80.5% under 5-fold cross-validation, establishing a 3.3% improvement over the best single-model baseline (77.2%). The findings suggest that physics-aware signal representations form a promising direction for radar-only sign language recognition under constrained sensor modalities. The source code is available at: https://github.com/Shakhoyat/CAST-at-SignEval2026.
Problem

Research questions and friction points this paper is trying to address.

sign language recognition
60 GHz radar
Range-Time Maps
magnitude-only sensing
radar-only modality
Innovation

Methods, ideas, or system contributions that make the work stand out.

channel-aware spatial transfer learning
pseudo-image radar
physics-aware signal representation
asymmetric cross-attention
Cadence Velocity Diagram
M
Md. Shakhoyat Rahman Shujon
Department of Computer Science and Engineering, Khulna University of Engineering & Technology
S
Sheikh Md. Galib Mahim
Department of Computer Science and Engineering, Khulna University of Engineering & Technology
Md. Milon Islam
Md. Milon Islam
University of Waterloo
Multimodal Machine LearningAI for HealthLarge Language Models
M
Md Rezwanul Haque
Department of Electrical and Computer Engineering, University of Waterloo
Md Rabiul Islam
Md Rabiul Islam
Associate Professor, Electrical Engineering, University of Wollongong, NSW 2522, Australia
Power ConvertersRenewable EnergySmart Power GridsElectrical VehiclesElectrical Machines
Hamdi Altaheri
Hamdi Altaheri
PhD, Post Doctoral Scholar at University of Waterloo
Deep LearningFoundation ModelsSelf-Supervised Learning
F
Fakhri Karray
Department of Electrical and Computer Engineering, University of Waterloo; Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence