Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

πŸ“… 2025-03-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address two key bottlenecks of GRPO in multimodal large language models (MLLMs)β€”low data efficiency (i.e., insufficient positive rewards for hard samples) and text bias (i.e., neglect of visual information and over-reliance on textual cues)β€”this paper proposes Hint-GRPO. First, it introduces a difficulty-adaptive prompting mechanism that dynamically generates reasoning-guided prompts to enhance reward acquisition for hard samples. Second, it designs a test-time image-conditioned calibration module that explicitly mitigates text bias via cross-modal logits reweighting and token-level image-guided prediction calibration. To our knowledge, Hint-GRPO is the first framework within the GRPO paradigm to systematically tackle both image neglect and the lack of policy updates on hard samples in MLLMs. Extensive experiments across three foundation MLLMs and eleven benchmarks demonstrate significant improvements in mathematical and complex multimodal reasoning accuracy. The code is publicly available.

Technology Category

Application Category

πŸ“ Abstract
MLLM reasoning has drawn widespread research for its excellent problem-solving capability. Current reasoning methods fall into two types: PRM, which supervises the intermediate reasoning steps, and ORM, which supervises the final results. Recently, DeepSeek-R1 has challenged the traditional view that PRM outperforms ORM, which demonstrates strong generalization performance using an ORM method (i.e., GRPO). However, current MLLM's GRPO algorithms still struggle to handle challenging and complex multimodal reasoning tasks (e.g., mathematical reasoning). In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. Low data utilization refers to that GRPO cannot acquire positive rewards to update the MLLM on difficult samples, and text-bias is a phenomenon that the MLLM bypasses image condition and solely relies on text condition for generation after GRPO training. To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. Experiment results on three base MLLMs across eleven datasets demonstrate that our proposed methods advance the reasoning capability of original MLLM by a large margin, exhibiting superior performance to existing MLLM reasoning methods. Our code is available at https://github.com/hqhQAQ/Hint-GRPO.
Problem

Research questions and friction points this paper is trying to address.

Improving GRPO's data utilization for difficult MLLM samples
Addressing text-bias in MLLM post-GRPO training
Enhancing multimodal reasoning in challenging tasks (e.g., math)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hint-GRPO adaptively provides hints for samples
Text-bias calibration mitigates bias with image
Improves MLLM reasoning via adaptive hint and calibration
Qihan Huang
Qihan Huang
PhD Student, Zhejiang University
L
Long Chan
Alibaba Group
J
Jinlong Liu
Alibaba Group
Wanggui He
Wanggui He
Researcher, Alibaba Group
ai
H
Hao Jiang
Alibaba Group
M
Mingli Song
Zhejiang University
J
Jingyuan Chen
Zhejiang University
C
Chang Yao
Zhejiang University
J
Jie Song
Zhejiang University