Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses key limitations of multimodal large language models (MLLMs) in vision-based reinforcement learning (RL)—specifically weak generalization, reasoning biases, and training inefficiency—when tackling perception-intensive tasks like jigsaw puzzles. We propose a rule-guided visual RL framework, featuring a structured puzzle environment, a multi-stage training strategy (combining supervised fine-tuning [SFT] and rule-augmented RL), and cross-task transfer evaluation. Key contributions: (1) MLLMs implicitly encode sophisticated spatial reasoning patterns; (2) RL achieves significantly stronger generalization than SFT; (3) SFT-based cold-start initialization hinders subsequent RL optimization—an unintuitive training dynamic. Experiments demonstrate accuracy improvement from random chance to near-perfect performance, with robust generalization to unseen puzzle configurations and other vision tasks. The results validate the efficacy and underlying principles of rule-structured visual RL for MLLMs.

Technology Category

Application Category

📝 Abstract
The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings. extit{Firstly,} we find that MLLMs, initially performing near to random guessing on simple puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. extit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. extit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. extit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. extit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: href{https://github.com/zifuwanggg/Jigsaw-R1}{https://github.com/zifuwanggg/Jigsaw-R1}.
Problem

Research questions and friction points this paper is trying to address.

Studying rule-based visual RL challenges in MLLMs for perception-heavy tasks.
Exploring jigsaw puzzles as a framework for visual RL generalization.
Comparing RL and SFT effectiveness in multimodal learning scenarios.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned MLLMs achieve near-perfect puzzle accuracy
Jigsaw training generalizes to other visual tasks
RL outperforms SFT in generalization effectiveness
🔎 Similar Papers
No similar papers found.