Flex-Judge: Think Once, Judge Anywhere

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

LLM-as-a-Judge exhibits weak generalization and high annotation costs in cross-modal evaluation. Method: We propose the “reasoning-as-universal-representation” paradigm, enabling zero-shot cross-modal transfer with only minimal textual reasoning data. Our approach introduces a lightweight multimodal fusion architecture that jointly integrates instruction tuning and chain-of-thought distillation, augmented by contrastive learning to enforce consistency across reasoning paths. Contribution/Results: By treating structured textual reasoning as a modality-agnostic universal representation carrier, our method significantly enhances generalization across heterogeneous modalities—including images, videos, and molecular structures—especially under annotation scarcity. Experiments demonstrate state-of-the-art performance on eight cross-modal benchmarks, outperforming commercial APIs such as GPT-4V and Claude-3.5, while requiring only one-tenth of the textual training data conventionally used.

Technology Category

Application Category

📝 Abstract

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

Problem

Research questions and friction points this paper is trying to address.

Reducing manual annotation costs with LLM-as-a-Judge

Improving generalization across diverse multimodal tasks

Enhancing evaluation in resource-constrained domains like molecules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages textual reasoning for multimodal judgments

Minimizes training data with generalizable patterns

Outperforms commercial APIs with fewer resources

🔎 Similar Papers

No similar papers found.