Liger: Linearizing Large Language Models to Gated Recurrent Structures

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory and computational overhead of Transformer architectures in large language model (LLM) deployment, this paper proposes a parameter-free gated linear recurrent modeling approach: it repurposes pretrained Key matrices to construct diverse gating mechanisms, enabling efficient linearization of LLMs into gated recurrent structures. The method integrates weight remapping, LoRA-based lightweight fine-tuning, and the novel Liger Attention—a hybrid attention mechanism—achieving 93% of original performance with only 0.02% of pretrained tokens fine-tuned. On models ranging from 1B to 8B parameters, training complexity reduces to linear time, inference memory scales to constant space, and benchmark performance matches that of the original Transformer. Key innovations include a zero-parameter gating design and deep reuse of pretrained weights without architectural expansion.

Technology Category

Application Category

📝 Abstract
Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93% of the Transformer-based LLM at 0.02% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.
Problem

Research questions and friction points this paper is trying to address.

Linearizing pretrained LLMs into gated recurrent structures
Avoiding extra parameters and extensive fine-tuning
Restoring performance with lightweight fine-tuning and hybrid attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linearizes LLMs to gated recurrent structures
Repurposes key matrix weights for gating
Uses LoRA for lightweight fine-tuning
🔎 Similar Papers
No similar papers found.
Disen Lan
Disen Lan
Ph.D. student, Fudan University
Large Language ModelEfficient Deep Learning
Weigao Sun
Weigao Sun
Research Scientist, Shanghai AI Laboratory
LLMDeep LearningOptimization
J
Jiaxi Hu
The Hong Kong University of Science and Technology (Guangzhou)
J
Jusen Du
Shanghai AI Laboratory, Nanjing University
Y
Yu Cheng
The Chinese University of Hong Kong