Aligning Multimodal LLM with Human Preference: A Survey

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) face persistent challenges in factual consistency, safety, reasoning capability, and alignment with human preferences. This paper presents the first systematic survey of human preference alignment for MLLMs, encompassing diverse modalities—including images, videos, and audio—and introduces a unified four-dimensional analytical framework: application scenarios, data construction methodologies, technical alignment paradigms, and evaluation protocols—synthesizing over 100 state-of-the-art studies. Methodologically, it innovatively unifies supervised fine-tuning, reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), contrastive learning, and multimodal preference annotation techniques. As a key contribution, we release the first open-source, dynamically updated MLLM alignment repository. The work establishes actionable research directions—including scalability, trustworthiness, and embodiment—and serves as an authoritative reference for the community.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.
Problem

Research questions and friction points this paper is trying to address.

Addressing truthfulness and safety in multimodal LLMs.
Aligning multimodal LLMs with human preferences effectively.
Reviewing alignment algorithms for diverse multimodal applications.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Review alignment algorithms for Multimodal LLMs
Explore application scenarios and dataset construction
Evaluate benchmarks and future alignment directions
🔎 Similar Papers
No similar papers found.