WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video large language models lack temporal awareness in streaming scenarios, often failing to distinguish between current and historical frames, which leads to temporal confusion. This work proposes WeaveTime, a framework that explicitly diagnoses and addresses this issue for the first time. By introducing a temporal reconstruction objective during fine-tuning, the model learns ordered temporal representations, while an uncertainty-triggered coarse-to-fine historical caching mechanism dynamically retrieves relevant information during inference. Notably, WeaveTime requires no architectural modifications, offering a lightweight, model-agnostic solution that enables efficient streaming video understanding under causal constraints. Experiments demonstrate that WeaveTime significantly improves accuracy and reduces latency across multiple streaming video understanding benchmarks.

Technology Category

Application Category

📝 Abstract
Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/
Problem

Research questions and friction points this paper is trying to address.

Video-LLMs
streaming video
temporal order
time-awareness
causal sequence
Innovation

Methods, ideas, or system contributions that make the work stand out.

WeaveTime
Temporal Reconstruction
Streaming Video Understanding
Time-Aware Video-LLM
Dynamic Focus Cache
🔎 Similar Papers
No similar papers found.
Y
Yulin Zhang
ShanghaiTech University
C
Cheng Shi
School of Computing and Data Science, The University of Hong Kong
Sibei Yang
Sibei Yang
Associate Professor, School of Computer Science and Engineering, Sun Yat-Sen University