MixSignGraph: A Sign Sequence is Worth Mixed Graphs of Nodes

๐Ÿ“… 2025-04-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the limitations of CNN backbones in modeling regional collaboration and multi-granular temporal dynamics in sign language, this paper proposes a hybrid graph-structured sign language representation framework. It introduces three complementary graph modulesโ€”Local Sign Graph (LSG), Temporal Sign Graph (TSG), and Hierarchical Sign Graph (HSG)โ€”to explicitly capture intra-hand region interactions and spatiotemporal evolution. Furthermore, we design a text-guided CTC pretraining paradigm (TCP) that automatically generates pseudo-morpheme labels from textual sequences, eliminating reliance on costly manual morpheme annotations. This work establishes the first triple-graph joint modeling framework for sign language recognition, requiring no auxiliary modalities or human-labeled morphemes. Evaluated on five public benchmarks, our method consistently outperforms state-of-the-art approaches, achieving significant improvements across multiple metrics.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in sign language research have benefited from CNN-based backbones, which are primarily transferred from traditional computer vision tasks (eg object identification, image recognition). However, these CNN-based backbones usually excel at extracting features like contours and texture, but may struggle with capturing sign-related features. In fact, sign language tasks require focusing on sign-related regions, including the collaboration between different regions (eg left hand region and right hand region) and the effective content in a single region. To capture such region-related features, we introduce MixSignGraph, which represents sign sequences as a group of mixed graphs and designs the following three graph modules for feature extraction, ie Local Sign Graph (LSG) module, Temporal Sign Graph (TSG) module and Hierarchical Sign Graph (HSG) module. Specifically, the LSG module learns the correlation of intra-frame cross-region features within one frame, ie focusing on spatial features. The TSG module tracks the interaction of inter-frame cross-region features among adjacent frames, ie focusing on temporal features. The HSG module aggregates the same-region features from different-granularity feature maps of a frame, ie focusing on hierarchical features. In addition, to further improve the performance of sign language tasks without gloss annotations, we propose a simple yet counter-intuitive Text-driven CTC Pre-training (TCP) method, which generates pseudo gloss labels from text labels for model pre-training. Extensive experiments conducted on current five public sign language datasets demonstrate the superior performance of the proposed model. Notably, our model surpasses the SOTA models on multiple sign language tasks across several datasets, without relying on any additional cues.
Problem

Research questions and friction points this paper is trying to address.

Captures sign-related spatial and temporal features effectively
Models collaboration between different sign language regions
Improves performance without gloss annotations via pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

MixSignGraph represents sign sequences as mixed graphs
Three graph modules: LSG, TSG, HSG for features
Text-driven CTC Pre-training generates pseudo gloss labels
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shiwei Gan
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
Y
Yafeng Yin
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
Zhiwei Jiang
Zhiwei Jiang
Nanjing University
Natural Language Processing
Hongkai Wen
Hongkai Wen
University of Warwick
Machine LearningML/AI SystemsCyber-Physical Systems
L
Lei Xie
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
S
Sanglu Lu
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China