TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing video large language models (Video-LLMs) employ a uniform static processing path for all tokens in video temporal grounding (VTG), overlooking fundamental differences among its three subtasks: temporal localization, saliency assessment, and textual generation. To address this limitation, we propose an expert-guided dynamic-routing Video-LLM built upon a Mixture-of-Experts (MoE) architecture. Our method introduces a task-aware token routing mechanism that dynamically dispatches semantically heterogeneous tokens to specialized expert modules. This enables fine-grained event modeling and simultaneous generation of precise temporal boundaries, saliency scores, and natural language descriptions in the output. Evaluated on three core VTG tasks—dense video captioning, moment retrieval, and video highlight detection—our model achieves state-of-the-art performance, significantly outperforming prior approaches while improving computational efficiency.

Technology Category

Application Category

📝 Abstract

Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries. The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions. Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments demonstrate that TimeExpert consistently achieves state-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.

Problem

Research questions and friction points this paper is trying to address.

Precisely identify video segments from text queries

Dynamically route task-specific tokens to specialized experts

Improve event modeling in diverse VTG applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts for dynamic token routing

Specialized experts for distinct VTG subtasks

Improved efficiency and precision in event modeling

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models