🤖 AI Summary
Current multimodal large language model (MLLM) evaluation relies on costly, noise-prone annotated multimodal benchmarks, suffering from rapid saturation and poor coverage of emerging hallucination phenomena. To address this, we propose GenCeption: the first zero-shot evaluation framework that requires neither annotations nor aligned image-text pairs—only unlabeled unimodal data (e.g., images or text)—and constructs implicit evaluation benchmarks via generative self-supervised signals. Its core components include generative consistency modeling, implicit reference generation, cross-modal semantic alignment distillation, and unsupervised confidence calibration. Evaluated across multiple vision-language LLMs, GenCeption achieves strong correlation with human judgments (Spearman ρ > 0.89), significantly outperforming conventional zero-shot methods such as CLIP and BLIP. It simultaneously advances scalability, fidelity, and hallucination sensitivity in MLLM assessment.