🤖 AI Summary
Sign language transition generation aims to synthesize temporally coherent and semantically accurate continuous videos from discrete sign language segments; however, existing approaches predominantly rely on naive concatenation, resulting in spatiotemporal discontinuities and semantic distortions. To address this, we propose the first spatial-temporal graph-conditional diffusion framework for sign language transitions. Our method introduces a Sign-GCN module that explicitly models sign-specific spatiotemporal dependencies, jointly leveraging structure-aware skeletal representations and noise-driven transitional frame generation. It integrates graph convolutional networks, conditional diffusion modeling, and skeletal sequence encoders-decoders. Extensive experiments on PHOENIX14T, USTC-CSL100, and USTC-SLR500 demonstrate substantial improvements: FID decreases by 18.7%, while human evaluations show 23.5% gains in perceived naturalness and semantic fidelity.
📝 Abstract
Sign language transition generation seeks to convert discrete sign language segments into continuous sign videos by synthesizing smooth transitions. However,most existing methods merely concatenate isolated signs, resulting in poor visual coherence and semantic accuracy in the generated videos. Unlike textual languages,sign language is inherently rich in spatial-temporal cues, making it more complex to model. To address this,we propose StgcDiff, a graph-based conditional diffusion framework that generates smooth transitions between discrete signs by capturing the unique spatial-temporal dependencies of sign language. Specifically, we first train an encoder-decoder architecture to learn a structure-aware representation of spatial-temporal skeleton sequences. Next, we optimize a diffusion denoiser conditioned on the representations learned by the pre-trained encoder, which is tasked with predicting transition frames from noise. Additionally, we design the Sign-GCN module as the key component in our framework, which effectively models the spatial-temporal features. Extensive experiments conducted on the PHOENIX14T, USTC-CSL100,and USTC-SLR500 datasets demonstrate the superior performance of our method.