Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference

πŸ“… 2025-08-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video multimodal large language models (Video-MLLMs) suffer significant performance degradation in long-video understanding due to inherent limitations in LLM context length. To address this, we propose MoRefβ€”a training-free, single-forward-pass inference framework that pioneers the integration of Mixture-of-Experts (MoE) principles into Video-MLLM context extension. MoRef introduces multi-reference attention to enable cross-clip contextual awareness, coupled with visual token chunking and reconstruction plus shadow-layer post-cross-fusion, enabling 2–8Γ— video length scaling while preserving frame-level fidelity. The method requires no fine-tuning and achieves real-time inference on a single A100 GPU. On benchmarks including VideoMME and MLVU, MoRef outperforms state-of-the-art specialized long-video models, demonstrating substantial improvements in long-range temporal modeling capability.

Technology Category

Application Category

πŸ“ Abstract
Video Multimodal Large Language Models~(Video-MLLM) have achieved remarkable advancements in video understanding tasks. However, constrained by the context length limitation in the underlying LLMs, existing Video-MLLMs typically exhibit suboptimal performance on long video scenarios. To understand extended input frames, common solutions span token compression and streaming inference techniques, which sacrifice feature granularity or inference efficiency. Differently, to efficiently achieve comprehensive understanding of longer frame inputs, we draw ideas from MoE and propose a training-free approach extbf{Free-MoRef}, which instantly multiplexes the context perception capabilities of Video-MLLMs within one inference pass. Specifically, Free-MoRef reconstructs the vision tokens into several short sequences as multi-references. Subsequently, we introduce MoRef-attention, which gathers clues from the multi-reference chunks in parallel to summarize unified query activations. After the shadow layers in LLMs, a reference fusion step is derived to compose a final mixed reasoning sequence with key tokens from parallel chunks, which compensates the cross-reference vision interactions that are neglected in MoRef-attention. By splitting and fusing the long vision token sequences, Free-MoRef achieves improved performance under much lower computing costs in reasoning multiplexed context length, demonstrating strong efficiency and effectiveness. Experiments on VideoMME, MLVU, LongVideoBench show that Free-MoRef achieves full perception of 2$ imes$ to 8$ imes$ longer input frames without compression on a single A100 GPU while keeping instant responses, thereby bringing significant performance gains, even surpassing dedicatedly trained long-video-MLLMs. Codes are available at https://github.com/wkfdb/Free-MoRef
Problem

Research questions and friction points this paper is trying to address.

Enhancing Video-MLLMs for long video understanding
Overcoming context length limitations efficiently
Multiplexing perception without training or compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free multiplexing of Video-MLLM context perception
MoRef-attention gathers clues from multi-reference chunks
Splits and fuses long vision tokens for efficiency
πŸ”Ž Similar Papers
No similar papers found.
Kuo Wang
Kuo Wang
Sun Yat-Sen University
semi supervised learningobject detection
Q
Quanlong Zheng
OPPO AI Center, OPPO Inc., China
Junlin Xie
Junlin Xie
University of Electronic Science and Technology of China
RoboticsMachine Learning
Y
Yanhao Zhang
OPPO AI Center, OPPO Inc., China
J
Jinguo Luo
Harbin Institute of Technology, Shenzhen, China
H
Haonan Lu
OPPO AI Center, OPPO Inc., China
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
F
Fan Zhou
Sun Yat-sen University, Research Institute, Sun Yat-sen University, Shenzhen, China
G
Guanbin Li
Sun Yat-sen University, Peng Cheng Laboratory, Guangdong Key Laboratory of Big Data Analysis and Processing