SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

๐Ÿ“… 2026-04-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

212K/year
๐Ÿค– AI Summary
Multimodal large language models in reinforcement learning often over-rely on linguistic priors and human-annotated rewards, limiting their visual understanding capabilities and scalability. This work proposes a general self-supervised reinforcement learning framework that reformulates self-supervised visual tasks as โ€œvisual puzzlesโ€ requiring no human or external-model supervision, from which directly verifiable, purely visual reward signals are derived for post-training multimodal large models. The approach significantly improves performance on multimodal understanding and reasoning benchmarks, demonstrating the effectiveness and potential of vision-centric, self-supervised rewards in large-scale reinforcement learning.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.
Problem

Research questions and friction points this paper is trying to address.

multimodal large language models
reinforcement learning
self-supervised learning
visual understanding
verifiable rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning
reinforcement learning
multimodal large language models
visual reasoning
verifiable rewards