Video-LLMs with Temporal Visual Screening

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing video-LLMs suffer from sparse frame sampling and insufficient temporal supervision, limiting their ability to model fine-grained temporal semantics and emulate human-like progress-bar-style key-segment localization. To address this, we propose Temporal Visual Screening (TVS), a novel cognition-inspired task and front-end adaptation module that jointly optimizes video segment retention and query simplification to enhance temporal understanding during both training and inference. Our approach integrates video instruction tuning, query reconstruction, segment localization, and consistency preservation, implemented efficiently via ReSimplifyIt. We introduce the first dedicated TVS benchmark; experiments demonstrate a 7.33% improvement in training-stage performance and a 34.6% gain at inference time, with an F1-score gain of +0.47 over state-of-the-art methods on video editing tasks.

Technology Category

Application Category

📝 Abstract

Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.

Problem

Research questions and friction points this paper is trying to address.

Video-LLMs struggle with fine-grained temporal semantics understanding

Current models lack effective temporal visual screening capabilities

Need to align queries with focus-critical video segments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Visual Screening for video preprocessing

Retains focus-critical segments and simplifies queries

Modular adapter for training and inference pipelines

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs