Puzzle Curriculum GRPO for Vision-Centric Reasoning

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current reinforcement learning (RL)-based visual language models (VLMs) suffer from three key limitations in chain-of-thought (CoT) reasoning: reliance on costly human annotations or external verifiers, sparse and coarse-grained rewards, and logical inconsistency between generated reasoning chains and final answers. Method: We propose Puzzle Curriculum GRPO—an unsupervised RL framework that eliminates the need for annotations or external validators. It introduces self-supervised puzzle tasks (PatchFit, Rotation, Jigsaw) to yield verifiable, fine-grained rewards; incorporates difficulty-aware curriculum learning to mitigate reward sparsity and advantage decay; and enforces reasoning–answer consistency (RAC) via a dedicated monitoring and enhancement mechanism. Results: Evaluated on Qwen-3B/7B, our method significantly improves CoT reasoning quality across multiple benchmarks, enhances training stability, and boosts downstream accuracy. RAC scores are higher and decay more slowly, establishing a scalable, verifiable, and interpretable paradigm for RL-based post-training of VLMs.

Technology Category

Application Category

📝 Abstract
Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.
Problem

Research questions and friction points this paper is trying to address.

Eliminates reliance on costly hand-curated annotations or external verifiers
Addresses flat and sparse reward schemes in reinforcement learning for VLMs
Mitigates logical inconsistency between reasoning chains and final answers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised puzzle environments replace costly annotations
Difficulty-aware curriculum dynamically weights training samples
Consistency-enforcing rewards improve reasoning-answer alignment
🔎 Similar Papers
No similar papers found.