StgcDiff: Spatial-Temporal Graph Condition Diffusion for Sign Language Transition Generation

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Sign language transition generation aims to synthesize temporally coherent and semantically accurate continuous videos from discrete sign language segments; however, existing approaches predominantly rely on naive concatenation, resulting in spatiotemporal discontinuities and semantic distortions. To address this, we propose the first spatial-temporal graph-conditional diffusion framework for sign language transitions. Our method introduces a Sign-GCN module that explicitly models sign-specific spatiotemporal dependencies, jointly leveraging structure-aware skeletal representations and noise-driven transitional frame generation. It integrates graph convolutional networks, conditional diffusion modeling, and skeletal sequence encoders-decoders. Extensive experiments on PHOENIX14T, USTC-CSL100, and USTC-SLR500 demonstrate substantial improvements: FID decreases by 18.7%, while human evaluations show 23.5% gains in perceived naturalness and semantic fidelity.

Technology Category

Application Category

📝 Abstract

Sign language transition generation seeks to convert discrete sign language segments into continuous sign videos by synthesizing smooth transitions. However,most existing methods merely concatenate isolated signs, resulting in poor visual coherence and semantic accuracy in the generated videos. Unlike textual languages,sign language is inherently rich in spatial-temporal cues, making it more complex to model. To address this,we propose StgcDiff, a graph-based conditional diffusion framework that generates smooth transitions between discrete signs by capturing the unique spatial-temporal dependencies of sign language. Specifically, we first train an encoder-decoder architecture to learn a structure-aware representation of spatial-temporal skeleton sequences. Next, we optimize a diffusion denoiser conditioned on the representations learned by the pre-trained encoder, which is tasked with predicting transition frames from noise. Additionally, we design the Sign-GCN module as the key component in our framework, which effectively models the spatial-temporal features. Extensive experiments conducted on the PHOENIX14T, USTC-CSL100,and USTC-SLR500 datasets demonstrate the superior performance of our method.

Problem

Research questions and friction points this paper is trying to address.

Generating smooth transitions between discrete sign language segments

Improving visual coherence and semantic accuracy in sign videos

Modeling spatial-temporal dependencies in sign language effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based conditional diffusion framework

Structure-aware spatial-temporal skeleton encoding

Sign-GCN module for feature modeling

🔎 Similar Papers

An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs