🤖 AI Summary
This work addresses the high cost of training small language models from scratch and the limited scalability of conventional knowledge distillation, which requires repeated calls to a large teacher model when adapting to student models of varying sizes. The authors propose Chain-Bridged Distillation (CBD), a novel approach that constructs a sparse chain of intermediate anchor models to enable efficient knowledge transfer from a large language model to arbitrarily sized smaller models. CBD introduces, for the first time, a distillation chain combined with bridging distillation mechanisms, supporting cross-architecture and cross-vocabulary parameter interpolation for initialization without re-invoking the large teacher. Experiments demonstrate that, using only 10B tokens of data, a 138M-parameter model trained via CBD surpasses fully pre-trained baselines without any pretraining and achieves significant gains in downstream task performance and training efficiency under heterogeneous architectures and vocabularies.
📝 Abstract
Large language models (LLMs) achieve strong performance but remain costly to deploy in resource-constrained settings. Training small language models (SLMs) from scratch is computationally expensive, while conventional knowledge distillation requires repeated access to large teachers for different target sizes, leading to poor scalability. To solve these problems, we propose \textbf{Chain-based Distillation (CBD)}, a scalable paradigm for efficiently initializing variable-sized language models. A sparse and limited sequence of intermediate models (called anchors) is constructed via stepwise distillation, forming a distillation chain that progressively transfers knowledge from the source LLMs. To support heterogeneous settings, we introduce \emph{bridge distillation} for cross-architecture and cross-vocabulary transfer. Models of variable sizes are initialized via parameter interpolation between adjacent anchors, eliminating repeated large teacher inference. Experiments show that the proposed method substantially improves efficiency and downstream performance. A 138M-parameter SLM without recovery pre-training, outperforms scratch-trained models on a 10B-token corpus on the specific task. CBD also demonstrates versatility in heterogeneous settings for initialize models with different architectures and vocabularies.