SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing multimodal large model benchmarks lack evaluation of frame-level retrieval and reasoning capabilities that integrate domain-specific knowledge in short video scenarios, particularly in the visually ambiguous, long-tailed, and rapidly evolving domain of gaming. This work proposes the first Chinese game-focused short video benchmark for multimodal knowledge-intensive tasks, incorporating textual, visual, and multimodal retrieval interfaces within a frozen offline retrieval environment and a structured dataset. The benchmark supports diverse paradigms ranging from direct question answering to agent-driven retrieval. Evaluation shows that the best open-source direct QA model achieves 66.4% accuracy, practical agents reach 79.1%, and oracle performance attains 95.4%, highlighting critical bottlenecks in current approaches regarding tool invocation and evidence integration.

📝 Abstract

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

Problem

Research questions and friction points this paper is trying to address.

short-video frame search

multimodal knowledge-intensive reasoning

gaming domain

visual ambiguity

vertical domain knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal retrieval

short-video frame search

knowledge-intensive benchmark