Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Current video-language models (Video-LMs) exhibit significant limitations in spatiotemporal fine-grained reasoning—specifically, aligning action semantics with visual and temporal evidence. To address this, we introduce Know-Show: the first unified evaluation benchmark covering five scenario types—persons, objects, hand-object interactions, and more—built upon Charades, Action Genome, and Ego4D, comprising 2.5K human-annotated questions. Our evaluation reveals substantial performance gaps between state-of-the-art models (Qwen-VL, VideoLLaVA, GPT-4o, Gemini) and human annotators, especially on hand-object interaction tasks. Methodologically, we propose GRAM—a lightweight, training-free plugin that leverages attention-based video token selection and explicit timestamp encoding to jointly enable reasoning and temporal localization. GRAM enhances model interpretability (“what is known” ↔ “what is seen”) and significantly improves spatiotemporal alignment without architectural modification or retraining.

Technology Category

Application Category

📝 Abstract

Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to "show what they know" and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the code at https://github.com/LUNAProject22/Know-Show.

Problem

Research questions and friction points this paper is trying to address.

Evaluates spatio-temporal reasoning in video-language models.

Assesses grounding of actions in visual and temporal evidence.

Measures fine-grained interaction understanding in multimodal systems.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free plug-in for fine-grained video grounding

Attention-based token selection with timestamp encoding

Unified evaluation framework for spatio-temporal reasoning

🔎 Similar Papers

TVBench: Redesigning Video-Language Evaluation