Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the limited long-context processing capability of large vision-language models (VLMs), this paper introduces a novel architecture and training paradigm supporting ultra-long vision-language sequences—up to 1 million tokens or 4K video frames. Methodologically, it proposes: (i) a multi-stage progressive multimodal training strategy; (ii) a context-parallel inference mechanism coupled with a logits-masked language modeling head for efficient arbitrary-length image-text processing; and (iii) LLM-initialized multimodal alignment, joint vision-language training, and dual-platform (NPU/GPU) optimization. Experiments demonstrate that, trained solely on 17 million publicly available samples, the model surpasses recent closed-source state-of-the-art VLMs across major multimodal benchmarks—achieving comprehensive performance superiority while remaining fully open-source and reproducible.

Technology Category

Application Category

📝 Abstract

Establishing the long-context capability of large vision-language models is crucial for video understanding, high-resolution image understanding, multi-modal agents and reasoning. We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of $17$M samples from public datasets only and demonstrates the state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.

Problem

Research questions and friction points this paper is trying to address.

Scaling multi-modal models to handle 1 million tokens

Enabling long-context visual-language understanding for videos

Maintaining short-context accuracy while processing infinite inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scales to 1M tokens with multi-modal processing

Uses multi-stage training with long-sequence fine-tuning

Implements context-parallelism for infinite input scaling

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs