🤖 AI Summary
To address the challenges of context extension, multi-image performance degradation, and high computational overhead in multimodal large language models (MLLMs) for long-video sequence and high-resolution image analysis, this work proposes the first Mamba–Transformer hybrid architecture—leveraging Mamba’s linear-complexity state-space modeling and Transformer’s strong representational capacity. We introduce a spatiotemporally aware multi-image sequence construction method and a progressive training strategy to optimize vision–language alignment over extended contexts. Our approach supports single-pass inference on up to ~1,000 frames/images, achieving state-of-the-art or near-state-of-the-art performance across multiple video understanding and multi-image reasoning benchmarks. Notably, it enables efficient thousand-image inference on a single A100 80GB GPU, with low memory footprint and high throughput, demonstrating strong practical deployability.
📝 Abstract
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as extit{degraded performance with more images} and extit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model extbf{LongLLaVA}~( extbf{Long}-Context extbf{L}arge extbf{L}anguage extbf{a}nd extbf{V}ision extbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.