Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the parameter efficiency challenge in visual–language large models (VLLMs) for effective cross-modal fusion. We systematically analyze 34 state-of-the-art VLLMs and, for the first time, unify their training paradigms into three categories—single-stage fine-tuning, two-stage fine-tuning, and direct adaptation—establishing the first taxonomy of VLLM efficiency grounded in training methodology. Our study fills a critical gap by providing the first systematic analysis of direct adaptation, empirically demonstrating that it achieves over 90% of two-stage fine-tuning performance with less than 1% parameter overhead. We comprehensively examine core components—including LLM backbones, vision encoders, multimodal fusion architectures, parameter-efficient adaptation techniques (e.g., LoRA, Adapters), and evaluation protocols—and synthesize key benchmarks and metrics. The work delivers both a theoretical framework and empirical evidence to advance efficient multimodal modeling.

Technology Category

Application Category

📝 Abstract

The integration of vision-language modalities has been a significant focus in multimodal learning, traditionally relying on Vision-Language Pretrained Models. However, with the advent of Large Language Models (LLMs), there has been a notable shift towards incorporating LLMs with vision modalities. Following this, the training paradigms for incorporating vision modalities into LLMs have evolved. Initially, the approach was to integrate the modalities through pretraining the modality integrator, named Single-stage Tuning. It has since branched out into methods focusing on performance enhancement, denoted as Two-stage Tuning, and those prioritizing parameter efficiency, referred to as Direct Adaptation. However, existing surveys primarily address the latest Vision Large Language Models (VLLMs) with Two-stage Tuning, leaving a gap in understanding the evolution of training paradigms and their unique parameter-efficient considerations. This paper categorizes and reviews 34 VLLMs from top conferences, journals, and highly cited Arxiv papers, focusing on parameter efficiency during adaptation from the training paradigm perspective. We first introduce the architecture of LLMs and parameter-efficient learning methods, followed by a discussion on vision encoders and a comprehensive taxonomy of modality integrators. We then review three training paradigms and their efficiency considerations, summarizing benchmarks in the VLLM field. To gain deeper insights into their effectiveness in parameter efficiency, we compare and discuss the experimental results of representative models, among which the experiment of the Direct Adaptation paradigm is replicated. Providing insights into recent developments and practical uses, this survey is a vital guide for researchers and practitioners navigating the efficient integration of vision modalities into LLMs.

Problem

Research questions and friction points this paper is trying to address.

Multi-modal Learning

Visual Information Fusion

Large Language Model Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Large Models

Parameter Efficiency

Integration Methods

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment