Llamion Technical Report

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge of efficiently transferring large language models from non-Llama architectures to the standard Llama architecture while preserving performance. The authors propose KEPT, a method that integrates training-free Optimized Parameter Mapping (OPM), conventional Normal Parameter Mapping (NPM), and cross-architecture Knowledge Distillation (XKD) to achieve behavioral alignment with minimal computational overhead. Using only 123 million tokens and four days of training on a single A100 GPU, they successfully convert the Orion-14B model into the Llamion series. The resulting Llamion-Base achieves 66.87% on KoMMLU—outperforming contemporary open-source models by 7 percentage points—while fully retaining the original model’s Python programming proficiency and 200K-context capability, marking the first demonstration of lossless architectural migration without extensive retraining.

📝 Abstract

We release Llamion, a family of 14B-parameter open-weight language models obtained by transforming Orion-14B into the standardized Llama-family architecture. The transformation is performed by Efficient Knowledge Preservation for Transformation (KEPT), a recipe that combines (i) Normal Parameter Mapping (NPM) for unchanged modules, (ii) Optimized Parameter Mapping (OPM), a training-free LayerNorm-to-RMSNorm initialization we prove optimal under the near-zero-mean activation regime induced by weight decay, and (iii) Cross-architecture Knowledge Distillation (XKD), an equal-size frozen-teacher distillation that aligns the converted model's outputs with the source model's on any reasonable input distribution. Llamion recovers Orion's behaviour on H6, MT-Bench, and KoMMLU with only ~123M tokens on a single A100 in four days; Llamion-Base reaches 66.87% on KoMMLU, exceeding the next-best entry of the Open Ko LLM Leaderboard by >7.0 absolute points at submission time. Capabilities entirely absent from the transfer corpus (Python programming and 200K-token context handling) survive the architectural transition intact. We release three checkpoints (Base, Chat, LongChat) that load with trust_remote_code=False in the Hugging Face Transformers library.

Problem

Research questions and friction points this paper is trying to address.

architecture transformation

knowledge preservation

language model conversion

cross-architecture distillation

parameter mapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

KEPT

Parameter Mapping

Cross-architecture Knowledge Distillation

RMSNorm Initialization

Model Transformation

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

No related jobs found.

Authors to Follow