🤖 AI Summary
This work addresses the challenges of automated evaluation and alignment optimization for multimodal large models (LMMs). To this end, we introduce LLaVA-Critic—the first open-source, general-purpose multimodal evaluation model—and propose the novel “multimodal self-critique” paradigm, enabling unified scoring and preference modeling across diverse tasks and dimensions (e.g., factual accuracy, relevance, visual consistency). LLaVA-Critic integrates a vision encoder with a large language model and is fine-tuned on high-quality critique instructions, trained on multi-source, multi-criteria evaluation datasets. Experiments demonstrate that LLaVA-Critic matches or surpasses closed-source models such as GPT-4V on standard benchmarks including MMBench, MME, and POPE. When deployed as a reward model in RLHF, it significantly improves the stability and alignment performance of preference learning. Overall, this work establishes a new paradigm for autonomous LMM iteration and superhuman feedback mechanisms.
📝 Abstract
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.