GenCeption: Evaluate vision LLMS with unlabeled unimodal data

📅 2024-02-22
🏛️ Computer Speech & Language
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language model (MLLM) evaluation relies on costly, noise-prone annotated multimodal benchmarks, suffering from rapid saturation and poor coverage of emerging hallucination phenomena. To address this, we propose GenCeption: the first zero-shot evaluation framework that requires neither annotations nor aligned image-text pairs—only unlabeled unimodal data (e.g., images or text)—and constructs implicit evaluation benchmarks via generative self-supervised signals. Its core components include generative consistency modeling, implicit reference generation, cross-modal semantic alignment distillation, and unsupervised confidence calibration. Evaluated across multiple vision-language LLMs, GenCeption achieves strong correlation with human judgments (Spearman ρ > 0.89), significantly outperforming conventional zero-shot methods such as CLIP and BLIP. It simultaneously advances scalability, fidelity, and hallucination sensitivity in MLLM assessment.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Evaluates Vision LLMs without annotated multimodal data.
Measures inter-modality semantic coherence and hallucination tendency.
Introduces MMECeption benchmark for Vision LLM performance assessment.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotation-free evaluation using unimodal data
Iterative description-generation steps for semantic drift
GC@T metric quantifies semantic coherence and hallucination