🤖 AI Summary
This work addresses the challenge of balancing generation quality and inference efficiency in pretrained decoder-only large language models (e.g., Gemma). We propose the first efficient adaptation paradigm from decoder-only to encoder-decoder architectures. Our method constructs heterogeneous-scale encoder-decoder models (e.g., Gemma-9B encoder paired with a custom 2B decoder), systematically optimizing pretraining objectives, parameter initialization, and optimization strategies, followed by instruction tuning and evaluation on SuperGLUE. Results show: (i) instruction-tuned Gemma-2B–2B improves performance by ~7% over baseline; (ii) the Gemma-9B–2B variant further outperforms Gemma-2B–2B by >3%; and (iii) the encoder’s representations achieve substantially higher SuperGLUE scores than those of same-size decoder-only models. To our knowledge, this is the first work to enable high-quality, computationally efficient cross-architectural adaptation—establishing a novel paradigm for LLM architecture evolution.
📝 Abstract
While decoder-only large language models (LLMs) have shown impressive results, encoder-decoder models are still widely adopted in real-world applications for their inference efficiency and richer encoder representation. In this paper, we study a novel problem: adapting pretrained decoder-only LLMs to encoder-decoder, with the goal of leveraging the strengths of both approaches to achieve a more favorable quality-efficiency trade-off. We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation compared to pretraining from scratch. We rigorously explore different pretraining objectives and parameter initialization/optimization techniques. Through extensive experiments based on Gemma 2 (2B and 9B) and a suite of newly pretrained mT5-sized models (up to 1.6B), we demonstrate the effectiveness of adaptation and the advantage of encoder-decoder LLMs. Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart. For example, Gemma 2B-2B outperforms Gemma 2B by $sim$7% after instruction tuning. Encoder-decoder adaptation also allows for flexible combination of different-sized models, where Gemma 9B-2B significantly surpasses Gemma 2B-2B by $>$3%. The adapted encoder representation also yields better results on SuperGLUE. We will release our checkpoints to facilitate future research.