SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current sign language generation evaluation relies on text-based back-translation, failing to model multimodal features such as facial expressions, spatial grammar, and prosody, and suffers from ambiguous error attribution. To address these limitations, we propose SiLVERScore—the first semantic-aware joint embedding metric for sign language generation evaluation. It constructs a shared semantic space for sign videos and corresponding texts via deep embedding learning, contrastive learning, and multimodal alignment, enabling direct measurement of semantic and prosodic similarity between generated and reference signs. SiLVERScore demonstrates strong robustness and cross-dataset generalization capability. On PHOENIX-14T and CSL-Daily, it achieves an ROC AUC of 0.99 and a misclassification overlap rate of less than 7% between correct and random samples—significantly outperforming conventional back-translation metrics.

Technology Category

Application Category

📝 Abstract
Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating sign language generation via back-translation introduces ambiguity
Existing metrics fail to capture multimodal nature of sign language
Hard to pinpoint errors from generation versus translation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantically-aware embeddings for joint evaluation
Directly assesses sign language in embedding space
Robust to semantic and prosodic variations
🔎 Similar Papers
No similar papers found.