Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge of balancing generation quality and inference efficiency in pretrained decoder-only large language models (e.g., Gemma). We propose the first efficient adaptation paradigm from decoder-only to encoder-decoder architectures. Our method constructs heterogeneous-scale encoder-decoder models (e.g., Gemma-9B encoder paired with a custom 2B decoder), systematically optimizing pretraining objectives, parameter initialization, and optimization strategies, followed by instruction tuning and evaluation on SuperGLUE. Results show: (i) instruction-tuned Gemma-2B–2B improves performance by ~7% over baseline; (ii) the Gemma-9B–2B variant further outperforms Gemma-2B–2B by >3%; and (iii) the encoder’s representations achieve substantially higher SuperGLUE scores than those of same-size decoder-only models. To our knowledge, this is the first work to enable high-quality, computationally efficient cross-architectural adaptation—establishing a novel paradigm for LLM architecture evolution.

Technology Category

Application Category

📝 Abstract

While decoder-only large language models (LLMs) have shown impressive results, encoder-decoder models are still widely adopted in real-world applications for their inference efficiency and richer encoder representation. In this paper, we study a novel problem: adapting pretrained decoder-only LLMs to encoder-decoder, with the goal of leveraging the strengths of both approaches to achieve a more favorable quality-efficiency trade-off. We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation compared to pretraining from scratch. We rigorously explore different pretraining objectives and parameter initialization/optimization techniques. Through extensive experiments based on Gemma 2 (2B and 9B) and a suite of newly pretrained mT5-sized models (up to 1.6B), we demonstrate the effectiveness of adaptation and the advantage of encoder-decoder LLMs. Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart. For example, Gemma 2B-2B outperforms Gemma 2B by $sim$7% after instruction tuning. Encoder-decoder adaptation also allows for flexible combination of different-sized models, where Gemma 9B-2B significantly surpasses Gemma 2B-2B by $>$3%. The adapted encoder representation also yields better results on SuperGLUE. We will release our checkpoints to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Adapting decoder-only LLMs to encoder-decoder models

Improving quality-efficiency trade-off via adaptation

Leveraging strengths of both model architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting decoder-only LLMs to encoder-decoder architecture

Optimizing pretraining objectives and parameter initialization

Flexible combination of different-sized encoder-decoder models

🔎 Similar Papers

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding