LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

📅 2024-09-04
🏛️ arXiv.org
📈 Citations: 61
Influential: 1
📄 PDF
🤖 AI Summary
To address the challenges of context extension, multi-image performance degradation, and high computational overhead in multimodal large language models (MLLMs) for long-video sequence and high-resolution image analysis, this work proposes the first Mamba–Transformer hybrid architecture—leveraging Mamba’s linear-complexity state-space modeling and Transformer’s strong representational capacity. We introduce a spatiotemporally aware multi-image sequence construction method and a progressive training strategy to optimize vision–language alignment over extended contexts. Our approach supports single-pass inference on up to ~1,000 frames/images, achieving state-of-the-art or near-state-of-the-art performance across multiple video understanding and multi-image reasoning benchmarks. Notably, it enables efficient thousand-image inference on a single A100 80GB GPU, with low memory footprint and high throughput, demonstrating strong practical deployability.

Technology Category

Application Category

📝 Abstract
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as extit{degraded performance with more images} and extit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model extbf{LongLLaVA}~( extbf{Long}-Context extbf{L}arge extbf{L}anguage extbf{a}nd extbf{V}ision extbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
Problem

Research questions and friction points this paper is trying to address.

Scaling multi-modal LLMs to handle 1000+ images efficiently
Addressing performance degradation with increasing image counts
Reducing high computational costs in long-context MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-Transformer architecture for efficiency
Data construction capturing temporal-spatial dependencies
Progressive training strategy for scaling capability
🔎 Similar Papers
No similar papers found.