Mi:dm 2.0 Korea-centric Bilingual Language Models

📅 2026-01-14

📈 Citations: 1

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitations of current large language models in handling Korean, which stem from low-quality training data and a lack of cultural alignment, hindering their ability to capture Korea-specific values, commonsense knowledge, and nuanced emotional expressions. To overcome these challenges, we propose Mi:dm 2.0—the first bilingual large language model systematically integrating Korean sociocultural commonsense and reasoning patterns. Through high-quality data curation, synthetic data generation, a curriculum learning–guided data mixing strategy, and a Korean-optimized tokenizer, Mi:dm 2.0 achieves deep contextual understanding of local nuances. Released under the MIT License in both general and lightweight variants, the model attains state-of-the-art zero-shot performance on Korean benchmarks such as KMMLU, significantly outperforming existing models and advancing the development of the K-intelligence ecosystem.

Technology Category

Application Category

📝 Abstract

We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at https://huggingface.co/K-intelligence. For technical inquiries, please contact midm-llm@kt.com.

Problem

Research questions and friction points this paper is trying to address.

Korean-centric AI

bilingual language models

cultural alignment

low-quality Korean data

commonsense knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Korea-centric LLM

cultural alignment

high-quality synthetic data