Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the weak generalization of front-view data in cross-view isolated sign language recognition (CV-ISLR), caused by substantial camera viewpoint variations. To tackle this challenge, we propose the first ensemble learning framework specifically designed for CV-ISLR. Methodologically, we innovatively integrate multi-view feature consistency modeling with the multi-dimensional spatiotemporal representations of Video Swin Transformer, constructing a unified RGB and RGB-D multimodal joint learning architecture. Furthermore, we introduce a dual-path ensemble strategy operating at both model-level and feature-level to significantly enhance cross-view robustness. Evaluated on the WWW 2025 CV-ISLR Challenge, our approach achieves third place in both the RGB and RGB-D tracks, demonstrating its effectiveness and practical applicability.

Technology Category

Application Category

📝 Abstract
In this paper, we present our solution to the Cross-View Isolated Sign Language Recognition (CV-ISLR) challenge held at WWW 2025. CV-ISLR addresses a critical issue in traditional Isolated Sign Language Recognition (ISLR), where existing datasets predominantly capture sign language videos from a frontal perspective, while real-world camera angles often vary. To accurately recognize sign language from different viewpoints, models must be capable of understanding gestures from multiple angles, making cross-view recognition challenging. To address this, we explore the advantages of ensemble learning, which enhances model robustness and generalization across diverse views. Our approach, built on a multi-dimensional Video Swin Transformer model, leverages this ensemble strategy to achieve competitive performance. Finally, our solution ranked 3rd in both the RGB-based ISLR and RGB-D-based ISLR tracks, demonstrating the effectiveness in handling the challenges of cross-view recognition. The code is available at: https://github.com/Jiafei127/CV_ISLR_WWW2025.
Problem

Research questions and friction points this paper is trying to address.

Addresses cross-view isolated sign language recognition
Enhances model robustness with ensemble learning
Leverages multi-dimensional Video Swin Transformer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble learning robustness
Multi-angle gesture understanding
Video Swin Transformer model
🔎 Similar Papers
No similar papers found.
F
Fei Wang
Hefei University of Technology, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
K
Kun Li
CCAI, Zhejiang University, Hangzhou, China
Y
Yiqi Nie
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Anhui University, Hefei, China
Z
Zhangling Duan
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
P
Peng Zou
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
Zhiliang Wu
Zhiliang Wu
Research Scientist, Siemens Technology
Representation learningMachine learningGaussian ProcessesHealthcare
Y
Yuwei Wang
Anhui Agricultural University, Hefei, China
Yanyan Wei
Yanyan Wei
Hefei University of Technology (HFUT)
Robust Image PerceptionLLMAI Agent