DisTime: Distribution-based Time Representation for Video Large Language Models

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video-LLMs suffer from coarse-grained temporal localization due to discrete time representations and scarce high-quality temporal annotations. To address this, we propose a probabilistic temporal modeling framework: (1) a learnable continuous-time embedding space with a novel distributed temporal encoder-decoder architecture; (2) an automated temporal annotation pipeline yielding InternVid-TG—a large-scale temporally grounded dataset containing 1.25M annotated events; and (3) cross-model collaborative annotation leveraging both Video-LLMs and specialized temporal models to enhance annotation fidelity. Our approach achieves state-of-the-art performance on three core temporal understanding tasks—temporal grounding, temporal question answering, and video referring—and maintains competitive accuracy on general video QA. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at https://github.com/josephzpng/DisTime.
Problem

Research questions and friction points this paper is trying to address.

Enhance temporal comprehension in Video-LLMs
Address boundary ambiguities in time representation
Overcome temporal granularity limitations in datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous temporal embedding space via learnable token
Distribution-based Time Decoder for probability distributions
Automated annotation for temporal granularity enhancement
🔎 Similar Papers
No similar papers found.
Y
Yingsen Zeng
Meituan Inc.
Z
Zepeng Huang
Meituan Inc.
Yujie Zhong
Yujie Zhong
Meituan Inc.
Computer Vision
Chengjian Feng
Chengjian Feng
Meituan
Computer VisionObject Detection
J
Jie Hu
Meituan Inc.
L
Lin Ma
Meituan Inc.
Y
Yang Liu
Wangxuan Institute of Computer Technology, Peking University