Trillion 7B Technical Report

πŸ“… 2025-04-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of developing efficient Korean-centric multilingual large language models (LLMs) under resource constraints. We propose Trillion-7Bβ€”the first trillion-parameter, Korean-hubbed multilingual LLM optimized for token efficiency. Methodologically, we introduce cross-lingual document attention (XLDA), integrated with language-aware data mixing, multilingual filtering, and a customized tokenizer, enabling efficient English knowledge transfer using only 2T training tokensβ€”of which just 10% are multilingual (Korean, Japanese, Chinese). Experiments demonstrate state-of-the-art or highly competitive performance across 27 English, Korean, Japanese, and Chinese benchmarks, with significantly improved cross-lingual consistency. Full training requires only 59.4K H100 GPU-hours (β‰ˆ$1.48M), achieving the highest token efficiency among existing Korean-centric multilingual LLMs.

Technology Category

Application Category

πŸ“ Abstract
We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours ($148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.
Problem

Research questions and friction points this paper is trying to address.

Develop token-efficient Korean-centric multilingual LLM
Enable efficient English-to-Asian language knowledge transfer
Achieve robust performance with minimal multilingual training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual Document Attention for knowledge transfer
Optimized data mixtures and language filtering
Tailored tokenizer construction for efficiency
πŸ”Ž Similar Papers
No similar papers found.
Sungjun Han
Sungjun Han
Trillion Labs
deep learningnatural language processingmeta-learning
Juyoung Suk
Juyoung Suk
KAIST
Large Language Models
S
Suyeong An
Trillion Labs
H
Hyungguk Kim
Trillion Labs
Kyuseok Kim
Kyuseok Kim
Trillion Labs
W
Wonsuk Yang
Trillion Labs
S
Seungtaek Choi
Trillion Labs
J
Jamin Shin
Trillion Labs