Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

VideoLLMs face dual bottlenecks—exponential computational overhead and degraded performance due to long-context degradation—when increasing input frame count to enhance temporal fine-grained understanding. To address this, we propose Video Parallel Scaling (VPS), a training-free, plug-and-play inference-time parallelization method. VPS partitions video frames into complementary subsets, processes them via independent inference streams, and aggregates output probabilities to strengthen temporal modeling—without expanding the context window. Instead, it optimizes perceptual bandwidth through visual evidence compression and adherence to the Chinchilla scaling law. Theoretical analysis demonstrates its information gain advantage. Evaluated on benchmarks including Video-MME and EventHallusion, VPS consistently outperforms baselines such as self-consistency across 2B–32B model scales, achieving superior memory efficiency and strong scalability.

Technology Category

Application Category

📝 Abstract

Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model's perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video's frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.

Problem

Research questions and friction points this paper is trying to address.

VideoLLMs face computational costs with more input frames

VPS expands perceptual bandwidth without increasing context window

Aggregates parallel streams to improve temporal reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel inference streams processing disjoint frame subsets

Aggregating output probabilities from complementary visual streams

Memory-efficient framework enhancing temporal reasoning without training

🔎 Similar Papers

From Image to Video, what do we need in multimodal LLMs?