🤖 AI Summary
This work addresses the limited generalization of encoder-only models (e.g., BERT, ModernBERT) on generative classification tasks, aiming to match decoder-based large language models (LLMs) without task-specific classification heads. Methodologically, it pioneers systematic reuse of the standard masked language modeling (MLM) head—combined with lightweight instruction tuning—to perform zero-shot and fine-tuned generative classification directly via the [MASK] token. Key contributions are threefold: (1) first theoretical and empirical validation that the MLM head can effectively substitute conventional classification heads; (2) identification of modern, diverse pretraining data as a critical prerequisite for unlocking this capability; and (3) demonstration that ModernBERT-Large-Instruct achieves 93% of Llama3-1B’s MMLU score with 60% fewer parameters, surpassing same-scale LLMs in zero-shot accuracy and outperforming traditional classification-head paradigms after fine-tuning.
📝 Abstract
While encoder-only models such as BERT and ModernBERT are ubiquitous in real-world NLP applications, their conventional reliance on task-specific classification heads can limit their applicability compared to decoder-based large language models (LLMs). In this work, we introduce ModernBERT-Large-Instruct, a 0.4B-parameter encoder model that leverages its masked language modelling (MLM) head for generative classification. Our approach employs an intentionally simple training loop and inference mechanism that requires no heavy pre-processing, heavily engineered prompting, or architectural modifications. ModernBERT-Large-Instruct exhibits strong zero-shot performance on both classification and knowledge-based tasks, outperforming similarly sized LLMs on MMLU and achieving 93% of Llama3-1B's MMLU performance with 60% less parameters. We also demonstrate that, when fine-tuned, the generative approach using the MLM head matches or even surpasses traditional classification-head methods across diverse NLU tasks.This capability emerges specifically in models trained on contemporary, diverse data mixes, with models trained on lower volume, less-diverse data yielding considerably weaker performance. Although preliminary, these results demonstrate the potential of using the original generative masked language modelling head over traditional task-specific heads for downstream tasks. Our work suggests that further exploration into this area is warranted, highlighting many avenues for future improvements.