World Model on Million-Length Video And Language With Blockwise RingAttention

📅 2024-02-13

🏛️ arXiv.org

📈 Citations: 39

✨ Influential: 3

career value

218K/year

🤖 AI Summary

Addressing the challenges of modeling ultra-long contexts (up to millions of tokens) and the lack of efficient architectures for video-language joint understanding, this paper introduces Blockwise RingAttention—a memory-efficient attention mechanism enabling scalable training and inference over extremely long sequences. Our method incorporates blockwise context expansion (from 4K to 1M tokens), large-scale long-sequence data cleaning and synthesis, and video-text alignment pretraining. Leveraging these techniques, we present the first open-source multimodal foundation model supporting million-token contexts, with 7B parameters. The model achieves state-of-the-art performance on long-context retrieval tasks and processes over one million tokens in a single forward pass—equivalent to approximately 1.5 hours of 4K video plus associated text. We publicly release an extensible training framework and model weights across multiple scales, establishing a new paradigm for long-horizon multimodal sequence modeling.

Technology Category

Application Category

📝 Abstract

Enabling long-context understanding remains a key challenge in scaling existing sequence models -- a crucial component in developing generally intelligent models that can process and operate over long temporal horizons that potentially consist of millions of tokens. In this paper, we aim to address these challenges by providing a comprehensive exploration of the full development process for producing 1M context language models and video-language models, setting new benchmarks in language retrieval and new capabilities in long video understanding. We detail our long context data curation process, progressive context extension from 4K to 1M tokens, and present an efficient open-source implementation for scalable training on long sequences. Additionally, we open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens.

Problem

Research questions and friction points this paper is trying to address.

Scaling sequence models for long-context understanding

Developing 1M context language and video-language models

Efficient training on long sequences with open-source models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Blockwise RingAttention for scalability

Progressive extension to 1M tokens

Open-source 7B parameter models

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs