Learning Self-Correction in Vision-Language Models via Rollout Augmentation

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing reinforcement learning approaches struggle to efficiently train the self-correction capabilities of vision-language models (VLMs) due to the sparsity of effective self-correction samples. To address this challenge, this work proposes the Octopus framework, which enhances sample efficiency by recombining existing rollouts to generate dense correction samples and introduces a response masking strategy to decouple the correction process from direct reasoning. This approach enables, for the first time, efficient and controllable training of VLM self-correction. The resulting Octopus-8B model achieves state-of-the-art performance among open-source VLMs across seven benchmarks, outperforming the best RLVR baseline by 1.0 point while requiring only 0.72× the training time per step.

Technology Category

Application Category

📝 Abstract

Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

Problem

Research questions and friction points this paper is trying to address.

self-correction

vision-language models

reinforcement learning

sparse rewards

reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-correction

rollout augmentation

vision-language models