ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vietnamese lacks high-quality models and evaluation resources supporting fine-grained semantic understanding—particularly word sense disambiguation (WSD) and context-aware representation learning. To address this, we introduce ViConWSD, the first large-scale synthetic WSD benchmark for Vietnamese, and propose a contextualized word embedding model that jointly integrates contrastive learning (SimCLR) with gloss-based knowledge distillation, enabling unified modeling of discrete sense distinctions and continuous semantic similarity. Our approach fine-tunes a pretrained language model via context-gloss alignment to refine semantic representations. Experiments show state-of-the-art performance: 0.87 F1 on WSD, and 0.88 average precision (AP) and 0.60 Spearman’s rho on ViCon and ViSim-400, respectively—substantially outperforming existing baselines. Our core contributions are (1) the first fine-grained semantic evaluation benchmark for Vietnamese, and (2) a novel context-aware framework for joint modeling of polysemous words.

Technology Category

Application Category

📝 Abstract
Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman's rho = 0.60), demonstrating its effectiveness in modeling both discrete senses and graded semantic relations. Our code, models, and data are available at https://github.com/tkhangg0910/ViConBERT
Problem

Research questions and friction points this paper is trying to address.

Developing contextualized Vietnamese embeddings for polysemous words
Creating first large-scale Vietnamese dataset for semantic evaluation
Addressing lack of robust semantic models for low-resource Vietnamese
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates contrastive learning and gloss-based distillation
Creates first large-scale synthetic Vietnamese semantic dataset
Outperforms baselines on word sense disambiguation tasks
🔎 Similar Papers
No similar papers found.
K
Khang T. Huynh
University of Information Technology, Ho Chi Minh City, Vietnam
D
Dung H. Nguyen
University of Information Technology, Ho Chi Minh City, Vietnam
Binh T. Nguyen
Binh T. Nguyen
VinUniversity
statisticsoptimal transport