Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

📅 2024-10-28

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address the high deployment cost of large language models (LLMs) and substantial performance degradation in existing layer-binding approaches, this paper proposes Relaxed Recursive Transformer (RRT). RRT achieves strong parameter sharing via recursive reuse of a single Transformer layer stack and relaxes binding constraints through depth-wise low-rank adaptation (LoRA). It introduces Continuous Depth-wise Batching—a novel inference paradigm enabling early exit and continuous depth scheduling. Theoretical analysis shows up to 2–3× throughput improvement. Evaluated on Gemma-1B, RRT recovers ~90% of Gemma-2B’s performance using only half the parameters, outperforming TinyLlama-1.1B and Pythia-1B significantly. Its core contribution lies in unifying tunable-strength parameter sharing, recursive structure initialization, and dynamic-depth inference within a single efficient small-model architecture—marking the first such integration.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit"layer tying"as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller"Recursive Transformers"that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines -- and can even recover most of the performance of the original"full-size"model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting. In a theoretical analysis, we show that this has the potential to lead to significant (2-3x) gains in inference throughput.

Problem

Research questions and friction points this paper is trying to address.

Reduce LLM deployment costs

Enhance parameter sharing efficiency

Improve model performance compactly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise LoRA enhances model flexibility

Recursive Transformers minimize unique parameters

Continuous Depth-wise Batching boosts inference speed

🔎 Similar Papers

ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation