Inference Compute-Optimal Video Vision Language Models

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses inference-stage resource allocation for video vision-language models (VLMs) under fixed computational budgets, jointly optimizing three scalable dimensions: language model size, number of video frames, and visual tokens per frame. Method: Leveraging large-scale hyperparameter sweeps, parametric performance modeling, and multi-dimensional constrained scaling analysis, we systematically characterize the compute–accuracy Pareto frontier for video VLM inference and uncover the dynamic influence of dataset scale on optimal configurations. We further propose a generalizable methodology for constructing optimal frontier curves and establish principled, deployment-oriented guidelines for selecting scaling factors. Contribution/Results: Our approach yields significant accuracy gains in video understanding under identical compute constraints and delivers quantifiable, transferable configuration recommendations. It is the first to formalize and empirically analyze the inference trade-offs in video VLMs across these three critical axes, enabling principled, data-aware architecture design for real-world applications.

Technology Category

Application Category

📝 Abstract
This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.
Problem

Research questions and friction points this paper is trying to address.

Optimal allocation of inference compute in video vision language models
Identify model configuration under fixed inference compute budgets
Understand task performance dependence on scaling factors and data size
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal allocation of inference compute budgets
Large-scale training sweeps for model configuration
Parametric modeling of task performance