Llamion Technical Report

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently transferring large language models from non-Llama architectures to the standard Llama architecture while preserving performance. The authors propose KEPT, a method that integrates training-free Optimized Parameter Mapping (OPM), conventional Normal Parameter Mapping (NPM), and cross-architecture Knowledge Distillation (XKD) to achieve behavioral alignment with minimal computational overhead. Using only 123 million tokens and four days of training on a single A100 GPU, they successfully convert the Orion-14B model into the Llamion series. The resulting Llamion-Base achieves 66.87% on KoMMLU—outperforming contemporary open-source models by 7 percentage points—while fully retaining the original model’s Python programming proficiency and 200K-context capability, marking the first demonstration of lossless architectural migration without extensive retraining.
📝 Abstract
We release Llamion, a family of 14B-parameter open-weight language models obtained by transforming Orion-14B into the standardized Llama-family architecture. The transformation is performed by Efficient Knowledge Preservation for Transformation (KEPT), a recipe that combines (i) Normal Parameter Mapping (NPM) for unchanged modules, (ii) Optimized Parameter Mapping (OPM), a training-free LayerNorm-to-RMSNorm initialization we prove optimal under the near-zero-mean activation regime induced by weight decay, and (iii) Cross-architecture Knowledge Distillation (XKD), an equal-size frozen-teacher distillation that aligns the converted model's outputs with the source model's on any reasonable input distribution. Llamion recovers Orion's behaviour on H6, MT-Bench, and KoMMLU with only ~123M tokens on a single A100 in four days; Llamion-Base reaches 66.87% on KoMMLU, exceeding the next-best entry of the Open Ko LLM Leaderboard by >7.0 absolute points at submission time. Capabilities entirely absent from the transfer corpus (Python programming and 200K-token context handling) survive the architectural transition intact. We release three checkpoints (Base, Chat, LongChat) that load with trust_remote_code=False in the Hugging Face Transformers library.
Problem

Research questions and friction points this paper is trying to address.

architecture transformation
knowledge preservation
language model conversion
cross-architecture distillation
parameter mapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

KEPT
Parameter Mapping
Cross-architecture Knowledge Distillation
RMSNorm Initialization
Model Transformation
🔎 Similar Papers
No similar papers found.