MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

📅 2025-10-18

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing multi-turn vision-language dialogue benchmarks suffer from insufficient coverage, hindering comprehensive evaluation of large models’ integrated capabilities in complex interactive scenarios. To address this, we introduce MultiVerse—the first high-challenge benchmark specifically designed for multi-turn vision-language dialogue, comprising 647 four-turn dialogues and 484 diverse tasks spanning factual knowledge, mathematical reasoning, programming, and more. We propose a novel 37-dimensional checklist-based automated evaluation framework grounded in multi-source fusion, using GPT-4o as the evaluator to enable fine-grained scoring across perceptual accuracy, linguistic clarity, and factual correctness. Experimental results reveal that even state-of-the-art models—such as GPT-4o—achieve only ~50% task success rate across multi-turn interactions, demonstrating MultiVerse’s strong discriminative power and significant utility for rigorous model assessment.

Technology Category

Application Category

📝 Abstract

Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns - derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset's challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-turn conversational abilities of vision-language models

Addressing limitations in existing multi-turn dialogue benchmarks

Assessing performance across perceptual, linguistic and reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn benchmark with 647 diverse dialogues

Checklist-based evaluation using GPT-4o

Measures 37 aspects including perceptual accuracy

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues