HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in content-aware video streaming—namely, the high cost of manual annotation and the limited generalization of existing visual saliency models in dynamic importance weighting. To overcome these limitations, the authors propose a novel framework that integrates large language models (LLMs) into video saliency modeling for the first time. The framework comprises three modules: a local context-aware module for frame-level understanding, an LLM-guided merge-sort mechanism to ensure global consistency in importance ranking, and a multimodal temporal prediction model enabling real-time inference without future information. Evaluated in both on-demand and live streaming scenarios, the method improves weight prediction accuracy by 11.5% and 26%, respectively, while a user study demonstrates a 14.7% increase in correlation with quality of experience (QoE).

Technology Category

Application Category

📝 Abstract
Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs'limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video. (2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm. (3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5\% for VOD and 26\% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7\%.
Problem

Research questions and friction points this paper is trying to address.

content-aware streaming
video saliency
quality of experience (QoE)
importance weighting
Large Language Models (LLMs)
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided video saliency
content-aware streaming
perception module
global re-ranking
multi-modal time series prediction
🔎 Similar Papers
J
Jiahui Chen
Tsinghua University
Bo Peng
Bo Peng
Tsinghua University
AI for scienceMg porous implantsArchitected metamaterials design
L
Lianchen Jia
Tsinghua University
Z
Zeyu Zhang
The Australian National University
Tianchi Huang
Tianchi Huang
Sony
Adaptive Video StreamingReinforcement LearningCommunicatiion with ML
L
Lifeng Sun
Tsinghua University; Key Laboratory of Pervasive Computing, Ministry of Education