Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-based QA datasets inadequately support complex queries requiring visual demonstration. Method: We introduce RealVideoQuest—the first video-generation evaluation benchmark grounded in real user intent—comprising 7.5K user queries explicitly demanding video responses and 4.5K high-quality query-video pairs. Our approach pioneers systematic alignment between user queries and video response intent; proposes a multi-stage pipeline combining video retrieval with human refinement, cross-modal intent parsing, and a hybrid automatic-human evaluation framework assessing relevance, temporal plausibility, and semantic fidelity. Results: Experiments reveal pervasive intent misinterpretation and visual expression inaccuracies in current text-to-video (T2V) models. RealVideoQuest provides a reproducible benchmark and concrete diagnostic insights, establishing a foundation for advancing visually grounded question answering.

Technology Category

Application Category

📝 Abstract
Querying generative AI models, e.g., large language models (LLMs), has become a prevalent method for information acquisition. However, existing query-answer datasets primarily focus on textual responses, making it challenging to address complex user queries that require visual demonstrations or explanations for better understanding. To bridge this gap, we construct a benchmark, RealVideoQuest, designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries. It identifies 7.5K real user queries with video response intents from Chatbot-Arena and builds 4.5K high-quality query-video pairs through a multistage video retrieval and refinement process. We further develop a multi-angle evaluation system to assess the quality of generated video answers. Experiments indicate that current T2V models struggle with effectively addressing real user queries, pointing to key challenges and future research opportunities in multimodal AI.
Problem

Research questions and friction points this paper is trying to address.

Evaluating text-to-video models for real-world visual queries
Addressing lack of datasets for video-based query responses
Assessing multimodal AI challenges in video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs RealVideoQuest benchmark for T2V evaluation
Multi-stage video retrieval and refinement process
Multi-angle evaluation system for video quality
🔎 Similar Papers
No similar papers found.