ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the high computational cost and redundancy inherent in frame-by-frame processing for long video understanding. The authors propose an efficient compressed-domain representation that retains only sparse RGB keyframes to capture appearance information, while introducing a block motion denoising and refinement module to construct a compact, linearly scalable motion representation as an alternative to conventional optical flow. This approach seamlessly integrates with multimodal large language models and achieves state-of-the-art performance on multiple long-video benchmarks—including LongVideoBench, NExT-QA, and MLVU—demonstrating its effectiveness in significantly reducing computational complexity without compromising performance.

Technology Category

Application Category

📝 Abstract

While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.

Problem

Research questions and friction points this paper is trying to address.

long-video understanding

multimodal large language models

temporal dynamics

computational intractability

video redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

motion representation

video compression

multimodal large language model