🤖 AI Summary
Addressing the challenge of simultaneously achieving low latency and robustness for online continuous gesture recognition in dynamic environments, this paper proposes an end-to-end online recognition framework tailored for real-time skeletal streams. Methodologically, we introduce a lightweight online architecture that uniquely integrates a spatial graph convolutional network (S-GCN) with a graph-structured Transformer encoder (TGE), augmented by a continual learning mechanism to mitigate data distribution shift. Evaluated on the SHREC’21 benchmark, our approach achieves state-of-the-art accuracy while significantly reducing false positive rates. Unlike conventional segment-based recognition paradigms, our framework enables high-accuracy, low-latency, and adaptive recognition of continuous gesture streams—marking the first such solution. This work establishes a novel paradigm for human–robot collaboration and assistive technologies, balancing real-time responsiveness with strong generalization capability.
📝 Abstract
Online continuous action recognition has emerged as a critical research area due to its practical implications in real-world applications, such as human-computer interaction, healthcare, and robotics. Among various modalities, skeleton-based approaches have gained significant popularity, demonstrating their effectiveness in capturing 3D temporal data while ensuring robustness to environmental variations. However, most existing works focus on segment-based recognition, making them unsuitable for real-time, continuous recognition scenarios. In this paper, we propose a novel online recognition system designed for real-time skeleton sequence streaming. Our approach leverages a hybrid architecture combining Spatial Graph Convolutional Networks (S-GCN) for spatial feature extraction and a Transformer-based Graph Encoder (TGE) for capturing temporal dependencies across frames. Additionally, we introduce a continual learning mechanism to enhance model adaptability to evolving data distributions, ensuring robust recognition in dynamic environments. We evaluate our method on the SHREC'21 benchmark dataset, demonstrating its superior performance in online hand gesture recognition. Our approach not only achieves state-of-the-art accuracy but also significantly reduces false positive rates, making it a compelling solution for real-time applications. The proposed system can be seamlessly integrated into various domains, including human-robot collaboration and assistive technologies, where natural and intuitive interaction is crucial.