🤖 AI Summary
To address the scarcity of large-scale, high-quality parallel corpora for Japanese–English patent translation, this paper constructs the largest publicly available Japanese–English patent application parallel corpus to date—comprising over 300 million sentence pairs—from unexamined U.S. and Japanese patent publications spanning 2000–2021. We propose a novel two-stage alignment method that synergistically integrates DOCDB patent family linkage with dictionary-guided initialization: first generating high-confidence preliminary alignments using domain-specific bilingual dictionaries, then refining these alignments via a neural translation model. The resulting corpus substantially improves machine translation performance in this low-resource domain, yielding a +20 BLEU point gain on standard benchmarks. This resource establishes a critical infrastructure for domain-adapted translation systems and bilingual text analysis in intellectual property linguistics.
📝 Abstract
We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.