Chrono: A Simple Blueprint for Representing Time in MLLMs

📅 2024-06-26
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the weak temporal modeling and contextual understanding in video-language models, focusing on the video moment localization task. We propose Chrono, a lightweight and general-purpose sequence architecture that eliminates dedicated temporal modules and auxiliary input signals. Instead, Chrono achieves unified, implicit temporal modeling solely through extended positional encodings. It supports both zero-shot transfer and fine-tuning, requires no video transcripts or video-specific backbones, and seamlessly integrates with mainstream image-text pretrained multimodal large language models. Evaluated on four major benchmarks—Charades-STA, QVHighlights, ActivityNet Captions, and NeXT-GQA—Chrono establishes new state-of-the-art results across all, significantly improving moment retrieval and grounding-based video question answering. Notably, it is the first approach to demonstrate strong generalization across diverse architectures, training paradigms, and datasets.

Technology Category

Application Category

📝 Abstract
The recent success of Large Language Models (LLMs) has prompted the extension to the multimodal domain developing image-text Multimodal LLMs (MLLMs) and then video-text models. In this work, we investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos. To address this problem, prior works have developed complex task-specific architectures, novel modules to embed time into MLLMs, or leveraged additional input signals such as video transcripts to best encode contextual and temporal information. Interestingly, we find that most of these efforts are surpassed by a much simpler design. We introduce Chrono, a universal sequence blueprint that can be applied to an image-text pretrained MLLM. Through extensive ablations across different MLLM architectures, finetuning and zero-shot settings, and different datasets, we achieve a new SOTA in moment retrieval on the most widely used benchmarks Charades-STA, QVHighlights, ActivityNet Captions, and grounded video question answering on NeXT-GQA.
Problem

Research questions and friction points this paper is trying to address.

Addresses temporal comprehension in video-language models.
Introduces Chrono for time representation in MLLMs.
Achieves SOTA in moment retrieval and video QA.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chrono simplifies temporal comprehension
Universal sequence blueprint for MLLMs
Achieves SOTA in moment retrieval