Visual Large Language Models for Generalized and Specialized Applications

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current visual large language models (VLLMs) suffer from ill-defined application boundaries, inconsistent evaluation protocols, and lagging ethical governance across multimodal domains—including images, videos, depth maps, embodied actions, and natural language. Method: We propose a cross-modal, holistic taxonomy of VLLMs—the first of its kind—grounded in a “general capability–domain adaptation” co-evolution paradigm. Our framework integrates multimodal alignment, instruction tuning, chain-of-thought reasoning, embodied interaction modeling, and trustworthy AI analysis, spanning the full stack from training to deployment and evaluation. Contribution/Results: We design an extensible evaluation and governance framework, construct a structured VLLM application knowledge graph, and release an authoritative open-source paper repository (GitHub: awesome-VLLMs). This work establishes unified benchmarks and practical guidelines for both academic research and industrial deployment of VLLMs.

Technology Category

Application Category

📝 Abstract

Visual-language models (VLM) have emerged as a powerful tool for learning a unified embedding space for vision and language. Inspired by large language models, which have demonstrated strong reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining increasing attention for building general-purpose VLMs. Despite the significant progress made in VLLMs, the related literature remains limited, particularly from a comprehensive application perspective, encompassing generalized and specialized applications across vision (image, video, depth), action, and language modalities. In this survey, we focus on the diverse applications of VLLMs, examining their using scenarios, identifying ethics consideration and challenges, and discussing future directions for their development. By synthesizing these contents, we aim to provide a comprehensive guide that will pave the way for future innovations and broader applications of VLLMs. The paper list repository is available: https://github.com/JackYFL/awesome-VLLMs.

Problem

Research questions and friction points this paper is trying to address.

Visual Large Language Models

Application Scope

Future Development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Large Language Models

Multimodal Processing

Future Research Directions

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions