PEARL: Personalized Streaming Video Understanding Model

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing personalized multimodal approaches struggle to support real-time, continuous video understanding and interaction. It introduces Personalized Streaming Video Understanding (PSVU) as a novel task and presents PEARL-Bench, the first dedicated benchmark for PSVU, featuring both frame-level and video-level evaluations with high-quality, timestamped annotations generated through an automated pipeline followed by human verification. The authors further propose PEARL, a plug-and-play, training-free strategy that leverages vision-language models to enable zero-shot personalized inference. Extensive experiments across eight models demonstrate that PEARL consistently and significantly enhances PSVU performance across three mainstream architectures, achieving state-of-the-art results.

Technology Category

Application Category

📝 Abstract
Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.
Problem

Research questions and friction points this paper is trying to address.

Personalized Streaming Video Understanding
real-time personalization
vision-language models
streaming video
interactive AI assistants
Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized Streaming Video Understanding
PEARL-Bench
training-free personalization
video-level understanding
vision-language models
🔎 Similar Papers
No similar papers found.