π€ AI Summary
This work addresses the challenge of obtaining high-quality sentence representations for low-resource languages like Vietnamese, where limited annotated data hinders natural language understanding (NLU) performance. To overcome this bottleneck, the authors propose ViCLSRβthe first framework that integrates supervised contrastive learning with natural language inference (NLI) for Vietnamese sentence embeddings. By adapting existing Vietnamese NLI datasets to support contrastive learning and fine-tuning pretrained language models accordingly, ViCLSR achieves substantial improvements over the strong PhoBERT baseline across five Vietnamese NLU benchmarks, with gains in F1 score or accuracy ranging from 4.33% to 9.02%. The results demonstrate that the proposed approach effectively mitigates data scarcity in low-resource settings while enhancing sentence representation quality.
π Abstract
High-quality text representations are crucial for natural language understanding (NLU), but low-resource languages like Vietnamese face challenges due to limited annotated data. While pre-trained models like PhoBERT and CafeBERT perform well, their effectiveness is constrained by data scarcity. Contrastive learning (CL) has recently emerged as a promising approach for improving sentence representations, enabling models to effectively distinguish between semantically similar and dissimilar sentences. We propose ViCLSR (Vietnamese Contrastive Learning for Sentence Representations), a novel supervised contrastive learning framework specifically designed to optimize sentence embeddings for Vietnamese, leveraging existing natural language inference (NLI) datasets. Additionally, we propose a process to adapt existing Vietnamese datasets for supervised learning, ensuring compatibility with CL methods. Our experiments demonstrate that ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy). ViCLSR shows that supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. Furthermore, we conduct an in-depth analysis of the experimental results to uncover the factors contributing to the superior performance of contrastive learning models. ViCLSR is released for research purposes in advancing natural language processing tasks.