LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

📅 2024-09-04

🏛️ arXiv.org

📈 Citations: 61

✨ Influential: 1

career value

237K/year

🤖 AI Summary

To address the challenges of context extension, multi-image performance degradation, and high computational overhead in multimodal large language models (MLLMs) for long-video sequence and high-resolution image analysis, this work proposes the first Mamba–Transformer hybrid architecture—leveraging Mamba’s linear-complexity state-space modeling and Transformer’s strong representational capacity. We introduce a spatiotemporally aware multi-image sequence construction method and a progressive training strategy to optimize vision–language alignment over extended contexts. Our approach supports single-pass inference on up to ~1,000 frames/images, achieving state-of-the-art or near-state-of-the-art performance across multiple video understanding and multi-image reasoning benchmarks. Notably, it enables efficient thousand-image inference on a single A100 80GB GPU, with low memory footprint and high throughput, demonstrating strong practical deployability.

Technology Category

Application Category

📝 Abstract

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as extit{degraded performance with more images} and extit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model extbf{LongLLaVA}~( extbf{Long}-Context extbf{L}arge extbf{L}anguage extbf{a}nd extbf{V}ision extbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.

Problem

Research questions and friction points this paper is trying to address.

Scaling multi-modal LLMs to handle 1000+ images efficiently

Addressing performance degradation with increasing image counts

Reducing high computational costs in long-context MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-Transformer architecture for efficiency

Data construction capturing temporal-spatial dependencies

Progressive training strategy for scaling capability

🔎 Similar Papers

No similar papers found.