COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

📅 2024-12-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 1

career value

212K/year

🤖 AI Summary

To address the prohibitively high GPU resource consumption of multimodal large language models (MLLMs) in online deployment on short-video platforms, this paper proposes a cascaded lightweight pre-screening and MLLM fine-grained judgment framework. We innovatively design a three-stage vision-text-audio fused inference paradigm: a lightweight first-stage model rapidly filters low-quality samples, while only critical instances trigger the computationally intensive MLLM for fine-grained quality understanding. All modules are jointly optimized end-to-end to ensure cross-modal feature alignment and decision-level synergy. Evaluated on the TikTok video management platform across two real-world quality understanding tasks, our framework achieves accuracy comparable to full MLLM inference (within ±0.3% absolute accuracy), while reducing GPU memory footprint by 62% and cutting per-request inference cost by 57%.

Technology Category

Application Category

📝 Abstract

Recently, with the emergence of recent Multimodal Large Language Model (MLLM) technology, it has become possible to exploit its video understanding capability on different classification tasks. In practice, we face the difficulty of huge requirements for GPU resource if we need to deploy MLLMs online. In this paper, we propose COEF-VQ, a novel cascaded MLLM framework for better video quality understanding on TikTok. To this end, we first propose a MLLM fusing all visual, textual and audio signals, and then develop a cascade framework with a lightweight model as pre-filtering stage and MLLM as fine-consideration stage, significantly reducing the need for GPU resource, while retaining the performance demonstrated solely by MLLM. To demonstrate the effectiveness of COEF-VQ, we deployed this new framework onto the video management platform (VMP) at TikTok, and performed a series of detailed experiments on two in-house tasks related to video quality understanding. We show that COEF-VQ leads to substantial performance gains with limit resource consumption in these two tasks.

Problem

Research questions and friction points this paper is trying to address.

Reducing GPU resource demands for MLLM video quality analysis

Enhancing video quality understanding with efficient cascaded framework

Balancing computational efficiency and classification performance in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded MLLM framework for video quality

Entropy-based pre-filtering to reduce GPU usage

Lightweight model prioritizes high-uncertainty samples

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs