Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of domain adaptation for sentence embedding in low-resource languages—specifically Japanese—where large-scale labeled data are scarce. To this end, we propose SDJC, a framework that leverages a syntax-preserving synthetic sentence generator to produce grammatically consistent yet semantically divergent sentence pairs, enabling effective unsupervised contrastive learning for efficient backbone model adaptation. Our key contributions are threefold: (1) the first dependency-syntax-constrained synthetic sentence generation mechanism; (2) JSTS—the first Japanese Semantic Textual Similarity benchmark covering multiple domains and difficulty levels—filling a critical evaluation gap; and (3) a pipeline integrating machine translation augmentation with contrastive fine-tuning, yielding substantial improvements on downstream tasks. We publicly release the JSTS dataset, training code, and the adapted Japanese sentence embedding models.

Technology Category

Application Category

📝 Abstract
Several backbone models pre-trained on general domain datasets can encode a sentence into a widely useful embedding. Such sentence embeddings can be further enhanced by domain adaptation that adapts a backbone model to a specific domain. However, domain adaptation for low-resource languages like Japanese is often difficult due to the scarcity of large-scale labeled datasets. To overcome this, this paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) that utilizes a data generator to generate sentences, which have the same syntactic structure to a sentence in an unlabeled specific domain corpus but convey different semantic meanings. Generated sentences are then used to boost contrastive learning that adapts a backbone model to accurately discriminate sentences in the specific domain. In addition, the components of SDJC like a backbone model and a method to adapt it need to be carefully selected, but no benchmark dataset is available for Japanese. Thus, a comprehensive Japanese STS (Semantic Textual Similarity) benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset. Datasets, codes and backbone models adapted by SDJC are available on our github repository https://github.com/ccilab-doshisha/SDJC.
Problem

Research questions and friction points this paper is trying to address.

Enhance Japanese sentence embeddings via domain adaptation.
Overcome scarcity of labeled datasets for Japanese language.
Develop a benchmark dataset for Japanese semantic similarity.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic sentence generation for domain adaptation
Contrastive learning enhances Japanese sentence embeddings
New Japanese STS benchmark dataset created
🔎 Similar Papers
No similar papers found.
Z
Zihao Chen
Graduate School of Science and Engineering, Doshisha University, 1-3, Tatara Miyakodani, Kyotanabe, 610-0394, Kyoto, Japan.
Hisashi Handa
Hisashi Handa
Kindai University
M
Miho Ohsaki
Graduate School of Science and Engineering, Doshisha University, 1-3, Tatara Miyakodani, Kyotanabe, 610-0394, Kyoto, Japan.
Kimiaki Shirahama
Kimiaki Shirahama
Doshisha University
Multimedia retrievalMachine learningData miningHuman activity recognition