GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Arabic Semantic Textual Similarity (STS) research has long suffered from the scarcity of high-quality datasets and Arabic-specific pretrained models, resulting in suboptimal semantic modeling accuracy and unreliable evaluation. To address this, we propose the first Matryoshka fine-grained representation learning framework tailored for Arabic, integrating a hybrid loss function that combines Natural Language Inference (NLI) triplet supervision with contrastive learning, alongside an Arabic-adapted pretraining–fine-tuning architecture. Evaluated on the STS subtask of the Massive Text Embedding Benchmark (MTEB), our method achieves state-of-the-art performance—outperforming general-purpose large language models (e.g., OpenAI’s embeddings) by 20–25%. It significantly enhances modeling of Arabic’s rich morphological variation and semantic ambiguity. This work establishes a reusable methodological paradigm for STS research in low-resource languages.

Technology Category

Application Category

📝 Abstract
Semantic textual similarity (STS) is a critical task in natural language processing (NLP), enabling applications in retrieval, clustering, and understanding semantic relationships between texts. However, research in this area for the Arabic language remains limited due to the lack of high-quality datasets and pre-trained models. This scarcity of resources has restricted the accurate evaluation and advance of semantic similarity in Arabic text. This paper introduces General Arabic Text Embedding (GATE) models that achieve state-of-the-art performance on the Semantic Textual Similarity task within the MTEB benchmark. GATE leverages Matryoshka Representation Learning and a hybrid loss training approach with Arabic triplet datasets for Natural Language Inference, which are essential for enhancing model performance in tasks that demand fine-grained semantic understanding. GATE outperforms larger models, including OpenAI, with a 20-25% performance improvement on STS benchmarks, effectively capturing the unique semantic nuances of Arabic.
Problem

Research questions and friction points this paper is trying to address.

Enhancing Arabic semantic textual similarity (STS) with limited resources
Improving Arabic text embedding using hybrid loss training
Addressing lack of high-quality Arabic datasets for NLP tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Matryoshka Representation Learning for embeddings
Hybrid loss training with Arabic triplets
Enhanced semantic understanding for Arabic
🔎 Similar Papers
No similar papers found.