CyberV: Cybernetics for Test-time Scaling in Video Understanding

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing video multimodal large language models (MLLMs) suffer from poor robustness, low accuracy, and inflexible computational resource utilization during test-time inference—especially in lightweight variants. To address these limitations, we propose the first cybernetics-inspired adaptive inference framework that introduces a sensor–controller–reasoning closed-loop system for frozen video MLLMs, enabling online self-monitoring, dynamic feedback-triggered correction, staged refinement, and lightweight resource scheduling—without any retraining. The framework is model-agnostic and compatible with diverse frozen MLLMs. Empirically, it achieves substantial gains: +8.3% and +10.0% on VideoMMMU for Qwen2.5-VL-7B and Qwen2.5-VL-72B, respectively, and +5.5% for InternVL3-8B—surpassing GPT-4o. Overall performance approaches human-expert levels, while consistent improvements are maintained across VideoMME and WorldSense benchmarks.

Technology Category

Application Category

📝 Abstract

Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.

Problem

Research questions and friction points this paper is trying to address.

Addresses computational demands in long video understanding

Enhances robustness and accuracy of MLLMs dynamically

Improves test-time adaptation without retraining models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cybernetic loop with sensor and controller

Self-monitoring and self-correction during inference

Dynamic resource allocation without retraining

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding