🤖 AI Summary
This work addresses the performance bottleneck in isolated hand gesture recognition using only 60 GHz radar amplitude information, which suffers from signal compression and loss of spatial details. To overcome this limitation, the authors propose CAST, a dual-stream architecture that innovatively applies a decibel-to-linear inverse transformation to recover the original signal’s dynamic range. The framework further incorporates a cross-antenna spatial attention mechanism and an asymmetric cross-attention fusion strategy to enhance radar-based gesture representations. Integrated with Cadence velocity maps, windowed FFT, and dual backbones—ConvNeXt-Tiny and EfficientNetV2-S—the system achieves a Top-1 accuracy of 80.5% under strict radar-exclusive constraints, as validated by five-fold cross-validation, outperforming the best single-model baseline by 3.3%.
📝 Abstract
We propose CAST, a dual-stream architecture that utilizes channel-aware spatial transfer learning for isolated sign language recognition addressing the challenges of magnitude-only 60~GHz radar Range-Time Maps (RTM). The proposed framework combines three physics-aware architectures with pretrained vision backbones, which operate under radar-only constraints across clinical and alphabetical gestures. First, an explicit decibel-to-linear inversion is combined with a windowed fast Fourier transform that extracts Cadence Velocity Diagrams (CVD) while avoiding the harmonic artifacts that arise from the spectral analysis of log-compressed signals. Second, a cross-antenna spatial attention module applies attention to raw antenna channels before the convolution, preserving inter-receiver amplitude covariance. Third, an asymmetric cross-attention mechanism fuses representations from parallel ConvNeXt-Tiny (CVD) and EfficientNetV2-S (RTM) backbones. Extensive experiments reveal that the architecture achieves a Top-1 accuracy of 80.5% under 5-fold cross-validation, establishing a 3.3% improvement over the best single-model baseline (77.2%). The findings suggest that physics-aware signal representations form a promising direction for radar-only sign language recognition under constrained sensor modalities. The source code is available at: https://github.com/Shakhoyat/CAST-at-SignEval2026.