Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Pure decoder-only large language models (e.g., Gemma) underperform on encoder-style downstream tasks such as classification, regression, and ranking. To address this, we propose the first systematic decoder-to-encoder architectural migration framework. Our approach reformulates Gemma’s attention mechanism via masked attention, introduces multi-strategy sequence pooling, refines dropout scheduling and hyperparameters, and performs end-to-end fine-tuning on GLUE and MS MARCO. Experimental results demonstrate that the adapted model significantly outperforms standard encoder baselines—including BERT and RoBERTa—across multiple GLUE tasks and the MS MARCO passage ranking benchmark. This validates that decoder-only architectures, when structurally reconfigured, can serve as highly effective general-purpose encoders. Our work establishes a novel paradigm for large model architecture generalization, bridging the functional gap between decoder-centric and encoder-centric designs.

Technology Category

Application Category

📝 Abstract

Decoder-based transformers, while revolutionizing language modeling and scaling to immense sizes, have not completely overtaken encoder-heavy architectures in natural language processing. Specifically, encoder-only models remain dominant in tasks like classification, regression, and ranking. This is primarily due to the inherent structure of decoder-based models, which limits their direct applicability to these tasks. In this paper, we introduce Gemma Encoder, adapting the powerful Gemma decoder model to an encoder architecture, thereby unlocking its potential for a wider range of non-generative applications. To optimize the adaptation from decoder to encoder, we systematically analyze various pooling strategies, attention mechanisms, and hyperparameters (e.g., dropout rate). Furthermore, we benchmark Gemma Encoder against established approaches on the GLUE benchmarks, and MS MARCO ranking benchmark, demonstrating its effectiveness and versatility.

Problem

Research questions and friction points this paper is trying to address.

Adapting decoder-based models for encoder tasks

Optimizing decoder-to-encoder adaptation strategies

Benchmarking Gemma Encoder on GLUE and MS MARCO

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts Gemma decoder to encoder architecture

Analyzes pooling, attention, and hyperparameters

Benchmarks on GLUE and MS MARCO benchmarks

🔎 Similar Papers

No similar papers found.