🤖 AI Summary
To address the lack of native multimodal capabilities in pure-text large language models (LLMs) such as Llama-3, this paper proposes a lightweight vision-augmented framework featuring modality-specific and shared attention mechanisms. The method freezes the original LLM parameters and trains only newly introduced image encoding/decoding modules. It employs modality-isolated feed-forward networks (FFNs), QKV projections, and normalization layers, while sharing self-attention weights to explicitly model cross-modal interactions. On the image generation side, diffusion modeling is integrated for high-fidelity synthesis. Evaluated rigorously, the approach preserves original language understanding capabilities while substantially enhancing multimodal comprehension and generation: image understanding accuracy improves by 20%, generation quality by 3.6%, and training computational cost is reduced by 50%. Moreover, the framework is transferable, enabling effective enhancement of existing vision-language models.
📝 Abstract
We present LMFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LMFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LMFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LMFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3's language capabilities. We also demonstrate that this framework can adapt existing vision-language models with multimodal generation ability. Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient multimodal model development.