INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

📅 2024-06-13

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

195K/year

🤖 AI Summary

The insurance domain lacks a systematic evaluation benchmark for Large Vision-Language Models (LVLMs) or Multimodal Large Language Models (MLLMs), hindering their practical deployment. Method: We introduce InsurBench, the first insurance-specific multimodal benchmark, covering four key domains—automobile, property, health, and agricultural insurance—and comprising 2.2K multiple-choice questions, 12 meta-tasks, and 22 fundamental tasks. We systematically formalize the insurance multimodal task taxonomy and propose a hierarchical evaluation framework. The high-quality dataset is constructed from diverse, real-world sources—including policy documents, claim images, and medical reports—and rigorously annotated by human experts. It supports unified evaluation of both closed-source (e.g., GPT-4o) and open-source (e.g., BLIP-2) LVLMs. Contribution/Results: Comprehensive evaluation of 12 state-of-the-art LVLMs reveals critical weaknesses in insurance-specific capabilities, particularly policy clause comprehension and visual evidence reasoning. The dataset and code are publicly released and widely adopted by the research community.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications such as image recognition and visual reasoning, and have also shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain-characterized by rich application scenarios and abundant multimodal data-has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for four representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like BLIP-2. This evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain and inspire interdisciplinary development. Our dataset and evaluation code are available at https://github.com/FDU-INS/INS-MMBench.

Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarks for LVLMs in insurance domain

Underexplored potential of LVLMs in insurance applications

No systematic review of insurance-related multimodal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops INS-MMBench for insurance LVLM evaluation

Hierarchical benchmark with 39 diverse tasks

Evaluates 11 LVLMs including GPT-4o and LLaVA

🔎 Similar Papers

No similar papers found.