Trillion 7B Technical Report

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge of developing efficient Korean-centric multilingual large language models (LLMs) under resource constraints. We propose Trillion-7B—the first trillion-parameter, Korean-hubbed multilingual LLM optimized for token efficiency. Methodologically, we introduce cross-lingual document attention (XLDA), integrated with language-aware data mixing, multilingual filtering, and a customized tokenizer, enabling efficient English knowledge transfer using only 2T training tokens—of which just 10% are multilingual (Korean, Japanese, Chinese). Experiments demonstrate state-of-the-art or highly competitive performance across 27 English, Korean, Japanese, and Chinese benchmarks, with significantly improved cross-lingual consistency. Full training requires only 59.4K H100 GPU-hours (≈$1.48M), achieving the highest token efficiency among existing Korean-centric multilingual LLMs.

Technology Category

Application Category

📝 Abstract

We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours ($148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.

Problem

Research questions and friction points this paper is trying to address.

Develop token-efficient Korean-centric multilingual LLM

Enable efficient English-to-Asian language knowledge transfer

Achieve robust performance with minimal multilingual training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual Document Attention for knowledge transfer

Optimized data mixtures and language filtering

Tailored tokenizer construction for efficiency

🔎 Similar Papers

The rising costs of training frontier AI models