ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and redundancy inherent in frame-by-frame processing for long video understanding. The authors propose an efficient compressed-domain representation that retains only sparse RGB keyframes to capture appearance information, while introducing a block motion denoising and refinement module to construct a compact, linearly scalable motion representation as an alternative to conventional optical flow. This approach seamlessly integrates with multimodal large language models and achieves state-of-the-art performance on multiple long-video benchmarks—including LongVideoBench, NExT-QA, and MLVU—demonstrating its effectiveness in significantly reducing computational complexity without compromising performance.

Technology Category

Application Category

📝 Abstract
While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.
Problem

Research questions and friction points this paper is trying to address.

long-video understanding
multimodal large language models
temporal dynamics
computational intractability
video redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

motion representation
video compression
multimodal large language model
long-video understanding
linear complexity
🔎 Similar Papers
No similar papers found.