Trillion 7B Technical Report

πŸ“… 2025-04-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

184K/year
πŸ€– AI Summary
This work addresses the challenge of developing efficient Korean-centric multilingual large language models (LLMs) under resource constraints. We propose Trillion-7Bβ€”the first trillion-parameter, Korean-hubbed multilingual LLM optimized for token efficiency. Methodologically, we introduce cross-lingual document attention (XLDA), integrated with language-aware data mixing, multilingual filtering, and a customized tokenizer, enabling efficient English knowledge transfer using only 2T training tokensβ€”of which just 10% are multilingual (Korean, Japanese, Chinese). Experiments demonstrate state-of-the-art or highly competitive performance across 27 English, Korean, Japanese, and Chinese benchmarks, with significantly improved cross-lingual consistency. Full training requires only 59.4K H100 GPU-hours (β‰ˆ$1.48M), achieving the highest token efficiency among existing Korean-centric multilingual LLMs.

Technology Category

Application Category

πŸ“ Abstract
We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours ($148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.
Problem

Research questions and friction points this paper is trying to address.

Develop token-efficient Korean-centric multilingual LLM
Enable efficient English-to-Asian language knowledge transfer
Achieve robust performance with minimal multilingual training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual Document Attention for knowledge transfer
Optimized data mixtures and language filtering
Tailored tokenizer construction for efficiency
πŸ”Ž Similar Papers