ViSS-R1: Self-Supervised Reinforcement Video Reasoning

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) excessively rely on textual cues for complex video reasoning, neglecting dynamic visual information—leading to shortcut learning and hallucination. To address this, we propose ViSS-R1, a framework that integrates self-supervised pretraining objectives (e.g., visual transformation modeling) into the R1 post-training paradigm. Central to our approach is Pretext-GRPO, a vision-centric self-supervised reinforcement learning algorithm that jointly optimizes pretext tasks and user queries. This enables explicit modeling of temporal structure, motion patterns, and fine-grained visual semantics. Evaluated on six mainstream video understanding benchmarks, ViSS-R1 consistently outperforms state-of-the-art methods. Ablations confirm its effectiveness in mitigating hallucination, suppressing text-based shortcuts, and enhancing generalization across diverse video reasoning tasks.

Technology Category

Application Category

📝 Abstract
Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addresses underutilization of visual information in video reasoning
Prevents shortcut learning and hallucination in MLLM video tasks
Enhances visual-centric understanding through self-supervised reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised reinforcement learning with Pretext-GRPO algorithm
ViSS-R1 integrates pretext tasks into MLLM R1 post-training
Models reason by identifying transformations and reconstructing videos
🔎 Similar Papers
No similar papers found.