π€ AI Summary
This work addresses the lack of efficient, scalable, and human-aligned evaluation methods for multimodal text-to-audio-video generation. It presents the first systematic exploration of using omni-modal large language models (omni-LLMs) as unified evaluators, leveraging chain-of-thought prompting to elicit their multimodal understanding and reasoning capabilities for automatically assessing semantic alignment and cross-modal consistency between generated audio/video and input text. Experimental results demonstrate that the proposed approach achieves correlation with human judgments comparable to conventional metrics across nine perceptual and alignment benchmarks, while outperforming them in semantically dense tasksβsuch as audio-text and video-text alignment and trimodal consistency. Furthermore, the method provides interpretable feedback to guide generation refinement, highlighting both the strengths of omni-LLMs in semantic evaluation and their limitations in temporal resolution.
π Abstract
State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.