🤖 AI Summary
Existing video multimodal large language models (MLLMs) suffer from poor robustness, low accuracy, and inflexible computational resource utilization during test-time inference—especially in lightweight variants. To address these limitations, we propose the first cybernetics-inspired adaptive inference framework that introduces a sensor–controller–reasoning closed-loop system for frozen video MLLMs, enabling online self-monitoring, dynamic feedback-triggered correction, staged refinement, and lightweight resource scheduling—without any retraining. The framework is model-agnostic and compatible with diverse frozen MLLMs. Empirically, it achieves substantial gains: +8.3% and +10.0% on VideoMMMU for Qwen2.5-VL-7B and Qwen2.5-VL-72B, respectively, and +5.5% for InternVL3-8B—surpassing GPT-4o. Overall performance approaches human-expert levels, while consistent improvements are maintained across VideoMME and WorldSense benchmarks.
📝 Abstract
Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.