Solar Open Technical Report

📅 2026-01-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of training high-performance large language models for low-resource languages due to severe data scarcity. The authors propose a systematic approach that synthesizes 4.5 trillion tokens of high-quality, domain-specific, reinforcement learning–oriented bilingual data, integrates a progressive curriculum strategy spanning 20 trillion tokens, and employs the efficient SnapPO reinforcement learning optimization framework to train a 102-billion-parameter bilingual mixture-of-experts (MoE) language model. This method achieves, for the first time, performance on par with state-of-the-art models in low-resource settings, demonstrating strong results across multiple benchmarks in both English and Korean, and significantly advancing the development of AI capabilities for under-resourced languages.

Technology Category

Application Category

📝 Abstract
We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.
Problem

Research questions and friction points this paper is trying to address.

underserved languages
data scarcity
large language models
reasoning capabilities
bilingual modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
data synthesis
curriculum learning
SnapPO
underserved languages
🔎 Similar Papers
No similar papers found.
Sungrae Park
Sungrae Park
Upstage AI
Document AILarge Language Models
S
Sanghoon Kim
Upstage Solar Team
J
Jungho Cho
Upstage Solar Team
G
Gyoungjin Gim
Upstage Solar Team
D
Dawoon Jung
Upstage Solar Team
M
Mikyoung Cha
Upstage Solar Team
E
Eunhae Choo
Upstage Solar Team
T
Taekgyu Hong
Upstage Solar Team
Minbyul Jeong
Minbyul Jeong
Upstage AI Research Scientist, Ph.D.
S
SeHwan Joo
Upstage Solar Team
Minsoo Khang
Minsoo Khang
Upstage AI
OCRIntelligent Document Parsing
E
Eunwon Kim
Upstage Solar Team
Minjeong Kim
Minjeong Kim
Associate Professor and Head, Dept. of Computer Science, University of North Carolina at Greensboro
Medical Image AnalysisMachine LearningDeep LearningComputational Neuroscience
Sujeong Kim
Sujeong Kim
University of Maryland
Yunsu Kim
Yunsu Kim
aiXplain, Inc.
Natural Language ProcessingMachine TranslationMachine Learning
H
Hyeonju Lee
Upstage Solar Team
S
Seunghyun Lee
Upstage Solar Team
S
Sukyung Lee
Upstage Solar Team
S
Siyoung Park
Upstage Solar Team
G
Gyungin Shin
Upstage Solar Team
I
Inseo Song
Upstage Solar Team
Wonho Song
Wonho Song
UpstageAI
S
Seonghoon Yang
Upstage Solar Team
S
Seungyoun Yi
Upstage Solar Team
S
Sanghoon Yoon
Upstage Solar Team
J
Jeonghyun Ko
Upstage Solar Team
Seyoung Song
Seyoung Song
Ph.D. Student at KAIST, School of Computing
NLP
Keunwoo Choi
Keunwoo Choi
talkpl.ai
Music Information RetrievalMachine LearningLanguage Models
H
Hwalsuk Lee
Upstage Solar Team
Sunghun Kim
Sunghun Kim
Associate Professor of Computer Science, The Hong Kong University of Science and Technology
Software EngineeringMachine LearningDeep LearningMining Software RepositoriesSoftware Testing
D
Du-Seong Chang
Upstage Solar Team
Kyunghyun Cho
Kyunghyun Cho
New York University, Genentech
Machine LearningDeep Learning
Junsuk Choe
Junsuk Choe
Associate Professor, Sogang University
Representation LearningLanguage ModelingMachine Unlearning
H
Hwaran Lee
Upstage Solar Team
J
Jae-Gil Lee
Upstage Solar Team
KyungTae Lim
KyungTae Lim
École normale supérieure
Natural Language Processing
Alice Oh
Alice Oh
KAIST Computer Science
machine learningNLPcomputational social science