Pretrained Hybrids with MAD Skills

📅 2024-06-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The growing complexity of large language model (LLM) architectures has rendered manual design of heterogeneous hybrid models increasingly infeasible due to prohibitive architecture search costs and the need for de novo pretraining. Method: This paper introduces Manticore, the first framework enabling automated construction of pretrained heterogeneous hybrid LMs. It employs differentiable neural architecture search (NAS) to discover optimal cross-paradigm combinations (e.g., Transformer-Mamba), incorporates cross-architecture feature projectors to enable plug-and-play reuse of pretrained weights, and supports end-to-end joint fine-tuning for programmable capability customization. Contribution/Results: Experiments demonstrate that Manticore significantly outperforms handcrafted hybrid baselines on the Long Range Arena (LRA) benchmark for long-context modeling and surpasses monolithic pretrained models across multiple standard benchmarks, establishing a new paradigm for efficient, customizable LLM architecture synthesis.

Technology Category

Application Category

📝 Abstract
While Transformers underpin modern large language models (LMs), there is a growing list of alternative architectures with new capabilities, promises, and tradeoffs. This makes choosing the right LM architecture challenging. Recently-proposed $ extit{hybrid architectures}$ seek a best-of-all-worlds approach that reaps the benefits of all architectures. Hybrid design is difficult for two reasons: it requires manual expert-driven search, and new hybrids must be trained from scratch. We propose $ extbf{Manticore}$, a framework that addresses these challenges. Manticore $ extit{automates the design of hybrid architectures}$ while reusing pretrained models to create $ extit{pretrained}$ hybrids. Our approach augments ideas from differentiable Neural Architecture Search (NAS) by incorporating simple projectors that translate features between pretrained blocks from different architectures. We then fine-tune hybrids that combine pretrained models from different architecture families -- such as the GPT series and Mamba -- end-to-end. With Manticore, we enable LM selection without training multiple models, the construction of pretrained hybrids from existing pretrained models, and the ability to $ extit{program}$ pretrained hybrids to have certain capabilities. Manticore hybrids outperform existing manually-designed hybrids, achieve strong performance on Long Range Arena (LRA) tasks, and can improve on pretrained transformers and state space models.
Problem

Research questions and friction points this paper is trying to address.

Automating hybrid architecture design by reusing pretrained models
Enabling LM selection without training multiple models from scratch
Creating pretrained hybrids with programmable capabilities across architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates hybrid architecture design via differentiable NAS
Reuses pretrained models with cross-architecture feature projectors
Fine-tunes multi-architecture hybrids end-to-end for enhanced capabilities
🔎 Similar Papers
No similar papers found.
Nicholas Roberts
Nicholas Roberts
PhD candidate UW-Madison
Machine LearningAutoMLdata-centric AI
S
Samuel Guo
University of Wisconsin-Madison
Z
Zhiqi Gao
University of Wisconsin-Madison
S
Satya Sai Srinath Namburi
University of Wisconsin-Madison
Sonia Cromp
Sonia Cromp
PhD Student, University of Wisconsin-Madison
machine learningartificial intelligence
C
Chengjun Wu
University of Wisconsin-Madison
C
Chengyu Duan
University of Wisconsin-Madison
Frederic Sala
Frederic Sala
Assistant Professor, University of Wisconsin
Data-centric AIMachine learningInformation theory