KORMo: Korean Open Reasoning Model for Everyone

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality annotated data scarcity impedes large language model (LLM) development for low-resource languages. Method: We propose a fully open-source bilingual LLM construction paradigm: training KORMo-10B—a 10.8B-parameter Korean–English model—from scratch using large-scale, high-quality synthetic data; empirically validating that synthetic data enables stable, long-horizon autoregressive pretraining without model collapse; and enhancing native-level Korean reasoning and discourse coherence via bilingual instruction tuning and balanced corpus coverage. Results: KORMo-10B matches or exceeds leading open-source multilingual models on Korean/English reasoning, knowledge, and instruction-following benchmarks. Crucially, we fully open-source all training data, code, logs, and end-to-end configuration—enabling the first fully reproducible, transparent, and verifiable bilingual LLM development pipeline.

Technology Category

Application Category

📝 Abstract
This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.
Problem

Research questions and friction points this paper is trying to address.

Developing a fully open bilingual Korean-English large language model
Demonstrating synthetic data stability during large-scale pretraining
Establishing transparent framework for low-resource language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual Korean-English model trained from scratch
Primarily uses synthetic Korean data for training
Open framework for low-resource language development
🔎 Similar Papers
No similar papers found.
M
Minjun Kim
KAIST MLP Lab
H
Hyeonseok Lim
KAIST MLP Lab
H
Hangyeol Yoo
KAIST MLP Lab
I
Inho Won
KAIST MLP Lab
S
Seungwoo Song
KAIST MLP Lab
M
Minkyung Cho
KAIST MLP Lab
J
Junhun Yuk
KAIST MLP Lab
C
Changsu Choi
KAIST MLP Lab
D
Dongjae Shin
KAIST MLP Lab
H
Huige Lee
KAIST NLPCL Lab
Hoyun Song
Hoyun Song
Postdoctoral researcher, KAIST
NLPKnowledge IntegrationDomain-Specific ModelingLLM
Alice Oh
Alice Oh
KAIST Computer Science
machine learningNLPcomputational social science
K
Kyungtae Lim
KAIST MLP Lab