GenCeption: Evaluate vision LLMS with unlabeled unimodal data

📅 2024-02-22

🏛️ Computer Speech & Language

📈 Citations: 2

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Current multimodal large language model (MLLM) evaluation relies on costly, noise-prone annotated multimodal benchmarks, suffering from rapid saturation and poor coverage of emerging hallucination phenomena. To address this, we propose GenCeption: the first zero-shot evaluation framework that requires neither annotations nor aligned image-text pairs—only unlabeled unimodal data (e.g., images or text)—and constructs implicit evaluation benchmarks via generative self-supervised signals. Its core components include generative consistency modeling, implicit reference generation, cross-modal semantic alignment distillation, and unsupervised confidence calibration. Evaluated across multiple vision-language LLMs, GenCeption achieves strong correlation with human judgments (Spearman ρ > 0.89), significantly outperforming conventional zero-shot methods such as CLIP and BLIP. It simultaneously advances scalability, fidelity, and hallucination sensitivity in MLLM assessment.