VITA: Towards Open-Source Interactive Omni Multimodal LLM

📅 2024-08-09
🏛️ arXiv.org
📈 Citations: 84
Influential: 16
📄 PDF
🤖 AI Summary
Existing open-source multimodal large language models (MLLMs) struggle to simultaneously support quad-modal (video, image, text, audio) understanding and natural, low-latency multi-turn interaction. This paper introduces the first open-source, end-to-end fully multimodal-fused MLLM, built upon the Mixtral 8×7B architecture. It extends the tokenizer to support bilingual (Chinese–English) text, integrates a unified vision–audio encoder, and employs a two-stage multitask training strategy: Stage I uses CLIP-style cross-modal contrastive learning for semantic alignment; Stage II performs multimodal instruction tuning to enhance interactive capabilities. The model is the first open-source MLLM to enable real-time speech-driven text-and-image generation and video question answering. Experiments demonstrate substantial improvements over state-of-the-art open-source MLLMs on multilingual (MMBench), visual (SEED-Bench), audio (AudioCaps), and cross-modal benchmarks.

Technology Category

Application Category

📝 Abstract
The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: https://vita-home.github.io.
Problem

Research questions and friction points this paper is trying to address.

Open-source models lack strong multimodal and interactive capabilities
VITA integrates Video, Image, Text, Audio processing in one model
Enhancing natural multimodal human-computer interaction experience
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expands Mixtral 8x7B with bilingual instruction tuning
Uses two-stage multi-task learning for multimodal alignment
Enhances multimodal human-computer interaction experience
🔎 Similar Papers
No similar papers found.