TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses critical challenges in video temporal grounding (VTG): the scarcity of effective optimization methods for multimodal large language models (MLLMs) and severe benchmark distortion. To tackle these issues, we propose a systematic solution: (1) constructing TimeLens-Bench—a rigorously re-annotated, high-fidelity evaluation benchmark—and TimeLens-100K—a high-quality training dataset; (2) introducing a novel temporal-text interleaved encoding mechanism and a “reasoning-free verification-based reinforcement learning” (RLVR) paradigm, integrating verifiable rewards, multi-stage instruction tuning, and knowledge distillation; and (3) exposing substantial distortions in mainstream benchmarks, enabling accurate model performance re-ranking. The resulting TimeLens series achieves state-of-the-art performance among open-source models on VTG, surpassing closed-source counterparts including GPT-5 and Gemini 2.5 Flash. All code, data, and models are fully open-sourced.

Technology Category

Application Category

📝 Abstract

This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Addresses unreliable evaluation standards in video temporal grounding benchmarks

Solves noisy training data issues for multimodal language models in VTG

Develops effective algorithmic designs for video temporal grounding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TimeLens-Bench with re-annotated benchmarks

Creates TimeLens-100K via automated re-annotation pipeline

Proposes RLVR training with interleaved time encoding

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs