IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

📅 2024-08-23

🏛️ Proceedings of the AAAI Conference on Artificial Intelligence

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the challenge of effectively integrating visual information into frozen large language models (LLMs), which often leads to weak multimodal capabilities and degraded NLP performance, this paper proposes the Intralayer Adapter Architecture (IAA). IAA inserts hierarchical, learnable multimodal adapters *within* each layer of a frozen Transformer-based LLM, enabling deep cross-layer collaboration between visual features and textual representations. This design introduces the novel “intralayer multi-depth adaptation” mechanism, which efficiently activates the frozen LLM’s multimodal understanding with only minimal aligned vision-language data. Experimental results demonstrate that IAA achieves state-of-the-art performance across multiple vision-language benchmarks while fully preserving the original LLM’s strong performance on standard NLP tasks. Notably, IAA yields significant improvements in visual grounding and general multimodal reasoning.

Technology Category

Application Category

📝 Abstract

In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the fine-tuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models will be released.

Problem

Research questions and friction points this paper is trying to address.

Enables frozen LLMs to handle multimodal tasks effectively

Prevents NLP capability loss in vision-language model training

Achieves high performance with small-scale datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inner-Adaptor Architecture enables frozen LLM multimodal capabilities

Multimodal adaptors interact with text-oriented transformer layers

Achieves superior performance with small-scale datasets

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs