🤖 AI Summary
This work addresses critical challenges in video temporal grounding (VTG): the scarcity of effective optimization methods for multimodal large language models (MLLMs) and severe benchmark distortion. To tackle these issues, we propose a systematic solution: (1) constructing TimeLens-Bench—a rigorously re-annotated, high-fidelity evaluation benchmark—and TimeLens-100K—a high-quality training dataset; (2) introducing a novel temporal-text interleaved encoding mechanism and a “reasoning-free verification-based reinforcement learning” (RLVR) paradigm, integrating verifiable rewards, multi-stage instruction tuning, and knowledge distillation; and (3) exposing substantial distortions in mainstream benchmarks, enabling accurate model performance re-ranking. The resulting TimeLens series achieves state-of-the-art performance among open-source models on VTG, surpassing closed-source counterparts including GPT-5 and Gemini 2.5 Flash. All code, data, and models are fully open-sourced.
📝 Abstract
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.