InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing trainable sparse attention methods for long-sequence Transformer models often introduce additional parameters, violating the “short pretraining + long fine-tuning” paradigm and leading to slow convergence and poor adaptability. To address this, we propose InfLLM-V2, a dense-sparse switchable attention mechanism that achieves seamless mode transition without introducing any new parameters—leveraging purely architectural parameter reuse. In short-sequence regimes, it employs standard dense attention; for long sequences, it automatically activates a trainable sparse attention pattern. This design preserves training consistency while significantly improving inference efficiency. Experiments demonstrate that InfLLM-V2 retains 98.1% and 99.7% of baseline performance on long-context understanding and chain-of-thought reasoning tasks, respectively, while accelerating inference by 4×. The implementation and the MiniCPM-4.1 model are publicly released.

Technology Category

Application Category

📝 Abstract
Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional extit{pretrain-on-short, finetune-on-long} workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce dense-sparse switchable attention framework, termed as InfLLM-V2. InfLLM-V2 is a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, InfLLM-V2 reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, InfLLM-V2 ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of InfLLM-V2 that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought reasoning demonstrate that InfLLM-V2 is 4$ imes$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. Based on the InfLLM-V2 framework, we have trained and open-sourced MiniCPM4.1 (https://huggingface.co/openbmb/MiniCPM4.1-8B), a hybrid reasoning model, providing a reproducible implementation for the research community.
Problem

Research questions and friction points this paper is trying to address.

Overcoming computational bottlenecks in long-sequence processing
Enabling seamless adaptation from short to long sequences
Reducing memory usage while maintaining model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense-sparse switchable attention for seamless adaptation
Reuses dense parameters without extra architecture changes
Efficient implementation reduces computational overhead significantly
🔎 Similar Papers
No similar papers found.