InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

To address the trade-off between degraded windowed attention performance and insufficient fine-grained modeling by linear attention in Vision-Language Models (VLMs) on long, information-dense sequences (e.g., OCR, video understanding), this paper proposes an efficient architecture supporting arbitrarily long vision-language sequences. Methodologically, it introduces: (1) a novel synergistic mechanism combining sliding-window attention with gated DeltaNet to jointly capture local details and global long-range dependencies; (2) a three-stage lightweight training paradigm—knowledge distillation pretraining, instruction tuning, and long-sequence supervised fine-tuning—achieving state-of-the-art (SOTA) performance using only 2% of standard training data; and (3) integration of FlashAttention-2, yielding a 3.6× inference speedup with constant memory footprint and latency. Experiments demonstrate real-time video understanding at 24 FPS during prefill, stable long-term memory retention, and overall performance competitive with leading Transformer-based VLMs.

Technology Category

Application Category

📝 Abstract

Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6 imes inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

Problem

Research questions and friction points this paper is trying to address.

Overcomes performance degradation in window attention for long sequences.

Addresses linear attention underperformance on information-intensive tasks like OCR.

Mitigates quadratic complexity and KV cache growth in Vision-Language Models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergizes sliding window attention with Gated DeltaNet

Employs three-stage training strategy for efficiency

Achieves linear complexity with constant memory footprint

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models