🤖 AI Summary
To address the underdeveloped open-source ecosystem for Japanese large language models (LLMs), this work introduces the first cross-institutional, cross-domain, full-stack open collaboration paradigm—ensuring end-to-end autonomy, transparency, and reproducibility in data curation, model training, and evaluation. Built upon the Transformer architecture, the approach incorporates Japanese-specific tokenization, rigorous high-quality corpus cleaning, multi-stage pretraining, and instruction fine-tuning, with fully documented and reproducible training pipelines. The initiative unites over 1,500 industry and academic researchers, resulting in the publicly released LLM-jp series (e.g., LLM-jp-13b), which achieves state-of-the-art performance on Japanese benchmarks including JA-MMLU and JCommonsenseQA. Its core contribution is the establishment of the first high-performance, fully open, and reproducible foundational LLM ecosystem for Japanese, now serving as the de facto standard base model in the Japanese AI community.
📝 Abstract
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.