Puzzle Curriculum GRPO for Vision-Centric Reasoning

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Current reinforcement learning (RL)-based visual language models (VLMs) suffer from three key limitations in chain-of-thought (CoT) reasoning: reliance on costly human annotations or external verifiers, sparse and coarse-grained rewards, and logical inconsistency between generated reasoning chains and final answers. Method: We propose Puzzle Curriculum GRPO—an unsupervised RL framework that eliminates the need for annotations or external validators. It introduces self-supervised puzzle tasks (PatchFit, Rotation, Jigsaw) to yield verifiable, fine-grained rewards; incorporates difficulty-aware curriculum learning to mitigate reward sparsity and advantage decay; and enforces reasoning–answer consistency (RAC) via a dedicated monitoring and enhancement mechanism. Results: Evaluated on Qwen-3B/7B, our method significantly improves CoT reasoning quality across multiple benchmarks, enhances training stability, and boosts downstream accuracy. RAC scores are higher and decay more slowly, establishing a scalable, verifiable, and interpretable paradigm for RL-based post-training of VLMs.

Technology Category

Application Category

📝 Abstract

Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.

Problem

Research questions and friction points this paper is trying to address.

Eliminates reliance on costly hand-curated annotations or external verifiers

Addresses flat and sparse reward schemes in reinforcement learning for VLMs

Mitigates logical inconsistency between reasoning chains and final answers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised puzzle environments replace costly annotations

Difficulty-aware curriculum dynamically weights training samples

Consistency-enforcing rewards improve reasoning-answer alignment

🔎 Similar Papers

No similar papers found.