InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing trainable sparse attention methods for long-sequence Transformer models often introduce additional parameters, violating the “short pretraining + long fine-tuning” paradigm and leading to slow convergence and poor adaptability. To address this, we propose InfLLM-V2, a dense-sparse switchable attention mechanism that achieves seamless mode transition without introducing any new parameters—leveraging purely architectural parameter reuse. In short-sequence regimes, it employs standard dense attention; for long sequences, it automatically activates a trainable sparse attention pattern. This design preserves training consistency while significantly improving inference efficiency. Experiments demonstrate that InfLLM-V2 retains 98.1% and 99.7% of baseline performance on long-context understanding and chain-of-thought reasoning tasks, respectively, while accelerating inference by 4×. The implementation and the MiniCPM-4.1 model are publicly released.

Technology Category

Application Category

📝 Abstract

Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional extit{pretrain-on-short, finetune-on-long} workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce dense-sparse switchable attention framework, termed as InfLLM-V2. InfLLM-V2 is a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, InfLLM-V2 reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, InfLLM-V2 ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of InfLLM-V2 that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought reasoning demonstrate that InfLLM-V2 is 4$ imes$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. Based on the InfLLM-V2 framework, we have trained and open-sourced MiniCPM4.1 (https://huggingface.co/openbmb/MiniCPM4.1-8B), a hybrid reasoning model, providing a reproducible implementation for the research community.

Problem

Research questions and friction points this paper is trying to address.

Overcoming computational bottlenecks in long-sequence processing

Enabling seamless adaptation from short to long sequences

Reducing memory usage while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense-sparse switchable attention for seamless adaptation

Reuses dense parameters without extra architecture changes

Efficient implementation reduces computational overhead significantly

🔎 Similar Papers

No similar papers found.