Chain-of-Model Learning for Language Model

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address low scalability and inflexibility in large language model (LLM) training and inference, this paper proposes the Chain-of-Models (CoM) learning paradigm. Methodologically, CoM explicitly embeds causal structure into the hidden states of each Transformer layer, yielding Chain-of-Representations (CoR) that enable progressive model scaling and elastic inference. We introduce a novel chain-style representation architecture coupled with cross-layer key-value (KV) sharing—termed CoLM-Air—which supports seamless model switching, prefill acceleration, and concurrent execution of multi-scale submodels. Built upon this paradigm, the Chain-of-Language-Model (CoLM) framework integrates chain-based latent dimension decomposition with progressive scaling. It maintains standard Transformer accuracy while significantly improving training efficiency and enabling on-demand invocation of submodels of varying sizes, thereby achieving real-time, low-overhead dynamic inference.

Technology Category

Application Category

📝 Abstract

In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.

Problem

Research questions and friction points this paper is trying to address.

Improves model training efficiency via Chain-of-Model paradigm

Enables flexible inference with multiple sub-model sizes

Enhances scalability and extensibility in Transformer architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Model (CoM) integrates causal relationships into hidden states

Chain-of-Representation (CoR) combines sub-representations at hidden dimension level

CoLM-Air shares KV mechanism across chains for extensibility

🔎 Similar Papers

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling