Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the lack of efficient, scalable, and human-aligned evaluation methods for multimodal text-to-audio-video generation. It presents the first systematic exploration of using omni-modal large language models (omni-LLMs) as unified evaluators, leveraging chain-of-thought prompting to elicit their multimodal understanding and reasoning capabilities for automatically assessing semantic alignment and cross-modal consistency between generated audio/video and input text. Experimental results demonstrate that the proposed approach achieves correlation with human judgments comparable to conventional metrics across nine perceptual and alignment benchmarks, while outperforming them in semantically dense tasks—such as audio-text and video-text alignment and trimodal consistency. Furthermore, the method provides interpretable feedback to guide generation refinement, highlighting both the strengths of omni-LLMs in semantic evaluation and their limitations in temporal resolution.

Technology Category

Application Category

📝 Abstract

State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.

Problem

Research questions and friction points this paper is trying to address.

text-to-video generation

multi-modal evaluation

audio-video-text alignment

human-aligned judgment

omni-modal LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

omni-LLM

multi-modal evaluation

text-to-audio-video generation