CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak fine-grained regression capability and poor robustness to large-count scenarios of Vision-Language Models (VLMs) in crowd counting, this paper proposes the Fuzzy Group Relative Policy Reward (FGRPR) framework. Methodologically, it introduces a fuzzy logic–based reward mechanism into the RL alignment of VLMs—marking the first such integration—to overcome the expressive limitations of binary rewards and enable gradient incentives for approximately correct predictions. It further combines Group Relative Policy Optimization (GRPO) with multi-stage vision-language instruction tuning. Evaluated on Qwen2.5-VL (3B/7B), FGRPR achieves state-of-the-art performance across five mainstream crowd counting benchmarks, outperforming GPT-4o, LLaMA2-90B, and supervised fine-tuning (SFT) baselines. In out-of-domain evaluation, it reduces mean absolute error (MAE) by 12.6% over SFT, with up to 37% error reduction in large-count scenes.

Technology Category

Application Category

📝 Abstract
We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1
Problem

Research questions and friction points this paper is trying to address.

Enhancing crowd counting precision with fuzzy rewards
Improving learning efficiency in vision-language models
Outperforming baseline models in diverse datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuzzy Group Relative Policy Reward framework
Nuanced fuzzy reward for precise outputs
Outperforms baseline models in accuracy
🔎 Similar Papers
No similar papers found.
Z
Zhiqiang Wang
Florida Atlantic University
Pengbin Feng
Pengbin Feng
Xidian University
Malware detectionVulnerability detectionBinary analysis
Yanbin Lin
Yanbin Lin
Florida Atlantic University
S
Shuzhang Cai
University of Texas at Dallas
Z
Zongao Bian
Georgia Institute of Technology
Jinghua Yan
Jinghua Yan
University of Utah
High Performance ComputingComputer VisionGraph Neural Networks
X
Xingquan Zhu
Florida Atlantic University