π€ AI Summary
This work addresses the challenge of developing efficient Korean-centric multilingual large language models (LLMs) under resource constraints. We propose Trillion-7Bβthe first trillion-parameter, Korean-hubbed multilingual LLM optimized for token efficiency. Methodologically, we introduce cross-lingual document attention (XLDA), integrated with language-aware data mixing, multilingual filtering, and a customized tokenizer, enabling efficient English knowledge transfer using only 2T training tokensβof which just 10% are multilingual (Korean, Japanese, Chinese). Experiments demonstrate state-of-the-art or highly competitive performance across 27 English, Korean, Japanese, and Chinese benchmarks, with significantly improved cross-lingual consistency. Full training requires only 59.4K H100 GPU-hours (β$1.48M), achieving the highest token efficiency among existing Korean-centric multilingual LLMs.
π Abstract
We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours ($148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.