SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the limitations of existing sign language datasets, which predominantly rely on raw video–text alignments and lack a unified pose-native interface suitable for open-world scenarios, thereby hindering modern pose-driven sign language recognition and generation. To bridge this gap, we introduce SignVerse-2M, a large-scale multilingual pose-native dataset comprising over two million clips spanning more than 25 sign languages, derived from real-world videos and processed through a standardized DWPose pipeline to extract 2D keypoint sequences. This study presents the first real-scenario-compatible data resource that supports mainstream pose-driven frameworks, overcoming laboratory-bound constraints by enabling appearance-agnostic pose representations and cross-lingual unified modeling. Accompanying the dataset, we release the data construction pipeline, formal task definitions, and a SignDW Transformer baseline model, demonstrating its effectiveness and system compatibility in multilingual pose-space modeling.

📝 Abstract

Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open-world settings than style-agnostic pose-processing models. 2. Recent pose-guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose-native paradigm while also targeting real-world open scenarios. We present SignVerse-2M, a large-scale multilingual pose-native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 25 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real-world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose-space modeling and its compatibility with modern pose-driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.

Problem

Research questions and friction points this paper is trying to address.

sign language

pose-native

open-world

multilingual

video-to-pose

Innovation

Methods, ideas, or system contributions that make the work stand out.

pose-native

sign language dataset

DWPose