REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LVLM evaluation benchmarks suffer from narrow coverage, failing to systematically assess model reliability (e.g., factual consistency, adversarial robustness) and value alignment (e.g., ethics, safety, privacy). To address this, we propose ReliVLM—the first dual-axis comprehensive evaluation framework for LVLMs—accompanied by a large-scale VQA benchmark of 144K samples spanning hallucination, adversarial vulnerability, bias, toxicity, and privacy leakage. Our methodology introduces multi-granularity question construction, adversarial image perturbations, moral scenario reasoning, and privacy-sensitivity testing, coupled with a standardized scoring system. Extensive evaluation across 26 state-of-the-art LVLMs reveals strong performance in perception and toxicity mitigation, yet significant deficiencies persist in adversarial robustness, privacy preservation, and ethical reasoning—highlighting critical gaps in current LVLM development.

Technology Category

Application Category

📝 Abstract
The rapid evolution of Large Vision-Language Models (LVLMs) has highlighted the necessity for comprehensive evaluation frameworks that assess these models across diverse dimensions. While existing benchmarks focus on specific aspects such as perceptual abilities, cognitive capabilities, and safety against adversarial attacks, they often lack the breadth and depth required to provide a holistic understanding of LVLMs' strengths and limitations. To address this gap, we introduce REVAL, a comprehensive benchmark designed to evaluate the extbf{RE}liability and extbf{VAL}ue of LVLMs. REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability, which assesses truthfulness (eg, perceptual accuracy and hallucination tendencies) and robustness (eg, resilience to adversarial attacks, typographic attacks, and image corruption), and Values, which evaluates ethical concerns (eg, bias and moral understanding), safety issues (eg, toxicity and jailbreak vulnerabilities), and privacy problems (eg, privacy awareness and privacy leakage). We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro. Our findings reveal that while current LVLMs excel in perceptual tasks and toxicity avoidance, they exhibit significant vulnerabilities in adversarial scenarios, privacy preservation, and ethical reasoning. These insights underscore critical areas for future improvements, guiding the development of more secure, reliable, and ethically aligned LVLMs. REVAL provides a robust framework for researchers to systematically assess and compare LVLMs, fostering advancements in the field.
Problem

Research questions and friction points this paper is trying to address.

Assessing reliability and values of Large Vision-Language Models comprehensively
Evaluating LVLMs on truthfulness, robustness, ethics, safety, and privacy
Identifying vulnerabilities in adversarial scenarios and ethical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for LVLM evaluation
144K VQA samples for reliability and values
Assesses 26 models including GPT-4o
🔎 Similar Papers
No similar papers found.
J
Jie Zhang
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
Z
Zheng Yuan
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
Zhongqi Wang
Zhongqi Wang
Institute of Computing Technology, Chinese Academy of Sciences
Model Robustness
Bei Yan
Bei Yan
Northeastern University
Signal Processing
Sibo Wang
Sibo Wang
The Chinese University of Hong Kong
Databases
X
Xiangkui Cao
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
Z
Zonghui Guo
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition
X
Xilin Chen
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.