🤖 AI Summary
To address the performance bottlenecks and high computational overhead in BERT-based word sense disambiguation (WSD)—stemming from imbalanced local/global semantic representation and redundant all-senses training—this paper proposes Poly-Encoder: a multi-head attention encoder that jointly models token-level (local) and sequence-level (global) semantics, augmented with batch-wise contrastive learning (BCL). BCL treats other senses within the same batch as negative examples, enabling sense-aware discriminative training without requiring a predefined sense inventory, thereby significantly reducing training redundancy. Evaluated on standard WSD benchmarks, Poly-Encoder achieves a +2.0% F1 improvement over strong baselines and reduces GPU training time by 37.6% compared to the non-BCL counterpart. To our knowledge, this is the first work to simultaneously achieve joint optimization of local–global semantic integration and computational efficiency in WSD.
📝 Abstract
Mainstream Word Sense Disambiguation (WSD) approaches have employed BERT to extract semantics from both context and definitions of senses to determine the most suitable sense of a target word, achieving notable performance. However, there are two limitations in these approaches. First, previous studies failed to balance the representation of token-level (local) and sequence-level (global) semantics during feature extraction, leading to insufficient semantic representation and a performance bottleneck. Second, these approaches incorporated all possible senses of each target word during the training phase, leading to unnecessary computational costs. To overcome these limitations, this paper introduces a poly-encoder BERT-based model with batch contrastive learning for WSD, named PolyBERT. Compared with previous WSD methods, PolyBERT has two improvements: (1) A poly-encoder with a multi-head attention mechanism is utilized to fuse token-level (local) and sequence-level (global) semantics, rather than focusing on just one. This approach enriches semantic representation by balancing local and global semantics. (2) To avoid redundant training inputs, Batch Contrastive Learning (BCL) is introduced. BCL utilizes the correct senses of other target words in the same batch as negative samples for the current target word, which reduces training inputs and computational cost. The experimental results demonstrate that PolyBERT outperforms baseline WSD methods such as Huang's GlossBERT and Blevins's BEM by 2% in F1-score. In addition, PolyBERT with BCL reduces GPU hours by 37.6% compared with PolyBERT without BCL.