HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

πŸ“… 2024-12-20
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing monolithic vision-language models (VLMs) suffer from a lack of unified vision-language embedding modules, hindering simultaneous achievement of strong multimodal capabilities and native large language model (LLM) linguistic performance. To address this, we propose HoVLEβ€”a high-performance monolithic VLM that preserves the original LLM architecture while enabling native multimodal understanding. Its core innovation is a novel all-modal joint embedding module that maps visual and textual inputs into a shared semantic space. HoVLE employs a three-stage training paradigm: (1) vision-text feature distillation, (2) cross-modal contrastive alignment, and (3) instruction fine-tuning. On benchmarks including MMBench and OCRBench, HoVLE matches or exceeds the performance of prominent modular VLMs (e.g., Qwen-VL, LLaVA) and significantly outperforms prior monolithic models (e.g., KOSMOS-2, mPLUG-Owl2). The model is publicly released on Hugging Face.

Technology Category

Application Category

πŸ“ Abstract
The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Furthermore, a multi-stage training strategy is carefully designed to empower the holistic embedding module. It is first trained to distill visual features from a pre-trained vision encoder and text embeddings from the LLM, enabling large-scale training with unpaired random images and text tokens. The whole model further undergoes next-token prediction on multi-modal data to align the embeddings. Finally, an instruction-tuning stage is incorporated. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks, outperforming previous monolithic models by a large margin. Model available at https://huggingface.co/OpenGVLab/HoVLE.
Problem

Research questions and friction points this paper is trying to address.

Improving monolithic Vision-Language Models
Holistic embedding for vision and language
Multi-stage training strategy for alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Holistic embedding module
Multi-stage training strategy
Shared space conversion
πŸ”Ž Similar Papers
No similar papers found.
Chenxin Tao
Chenxin Tao
PhD student, Tsinghua University
computer vision
Shiqian Su
Shiqian Su
PhD student, Tsinghua University
Large Language ModelEmbodied IntelligenceMultimodal models
Xizhou Zhu
Xizhou Zhu
Tsinghua University
C
Chenyu Zhang
Tsinghua University, Shanghai Artificial Intelligence Laboratory
Z
Zhe Chen
Nanjing University, Shanghai Artificial Intelligence Laboratory
Jiawen Liu
Jiawen Liu
Research Scientist, Meta
High Performance ComputingComputer ArchitectureMachine Learning Systems
W
Wenhai Wang
The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory
Lewei Lu
Lewei Lu
Research Director (We're Hiring, luotto@sensetime.com) @ SenseTime Research
Computer VisionDeep Learning
G
Gao Huang
Tsinghua University
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory
Jifeng Dai
Jifeng Dai
Associate Professor of EE, Tsinghua University
computer visiondeep learning