Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
This work addresses the challenge that existing hybrid sequence models struggle to reuse pretrained Transformer weights, leading to degraded short-context performance and limited long-context capabilities. The authors propose HyLo, a method that efficiently transforms a pretrained Transformer into a hybrid architecture integrating Multi-Head Latent Attention (MLA) with linear state-space models (e.g., Mamba2 or Gated DeltaNet) through architectural adaptation and staged long-context training. Leveraging teacher-guided distillation, HyLo enables stable optimization while preserving short-context accuracy. It achieves the first effective long-context upgrade of pretrained Transformers, extending context length by 32× and reducing KV cache memory by over 90%, thereby enabling 2M-token inference. HyLo significantly outperforms existing hybrid models on benchmarks such as RULER, GSM8K, and LM-Harness—even surpassing baselines trained with 40× more data.

Technology Category

Application Category

📝 Abstract
Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90\%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.
Problem

Research questions and friction points this paper is trying to address.

long-context
upcycling
hybrid LLM
Transformer
sequence modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

HyLo
long-context upcycling
hybrid sequence modeling
Multi-Head Latent Attention
KV-cache compression
🔎 Similar Papers