🤖 AI Summary
This paper addresses the challenge of domain adaptation for sentence embedding in low-resource languages—specifically Japanese—where large-scale labeled data are scarce. To this end, we propose SDJC, a framework that leverages a syntax-preserving synthetic sentence generator to produce grammatically consistent yet semantically divergent sentence pairs, enabling effective unsupervised contrastive learning for efficient backbone model adaptation. Our key contributions are threefold: (1) the first dependency-syntax-constrained synthetic sentence generation mechanism; (2) JSTS—the first Japanese Semantic Textual Similarity benchmark covering multiple domains and difficulty levels—filling a critical evaluation gap; and (3) a pipeline integrating machine translation augmentation with contrastive fine-tuning, yielding substantial improvements on downstream tasks. We publicly release the JSTS dataset, training code, and the adapted Japanese sentence embedding models.
📝 Abstract
Several backbone models pre-trained on general domain datasets can encode a sentence into a widely useful embedding. Such sentence embeddings can be further enhanced by domain adaptation that adapts a backbone model to a specific domain. However, domain adaptation for low-resource languages like Japanese is often difficult due to the scarcity of large-scale labeled datasets. To overcome this, this paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) that utilizes a data generator to generate sentences, which have the same syntactic structure to a sentence in an unlabeled specific domain corpus but convey different semantic meanings. Generated sentences are then used to boost contrastive learning that adapts a backbone model to accurately discriminate sentences in the specific domain. In addition, the components of SDJC like a backbone model and a method to adapt it need to be carefully selected, but no benchmark dataset is available for Japanese. Thus, a comprehensive Japanese STS (Semantic Textual Similarity) benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset. Datasets, codes and backbone models adapted by SDJC are available on our github repository https://github.com/ccilab-doshisha/SDJC.