CLS-RL: Image Classification with Rule-Based Reinforcement Learning

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the overfitting and poor generalization of multimodal large language models (MLLMs) in few-shot image classification fine-tuning. To this end, we propose CLS-RL, a rule-driven reinforcement learning framework. Our key contributions are: (1) the first verifiable, rule-guided RL fine-tuning paradigm for visual classification, where rule consistency serves as the reward signal—bypassing overfitting inherent in supervised fine-tuning (SFT); (2) the discovery of a “free lunch” phenomenon: training on a single dataset improves cross-distribution zero-shot performance; and (3) No-Thinking-CLS-RL, which eliminates reasoning-based thinking mechanisms to accelerate convergence and enhance generalization. Experiments demonstrate that our method achieves significantly higher average accuracy than SFT under both few-shot and base-to-new settings, while also attaining faster training, superior in-domain accuracy, and stronger cross-domain generalization.

Technology Category

Application Category

📝 Abstract
Classification is a core task in machine learning. Recent research has shown that although Multimodal Large Language Models (MLLMs) are initially poor at image classification, fine-tuning them with an adequate amount of data can significantly enhance their performance, making them comparable to SOTA classification models. However, acquiring large-scale labeled data is expensive. In this paper, we explore few-shot MLLM classification fine-tuning. We found that SFT can cause severe overfitting issues and may even degrade performance over the zero-shot approach. To address this challenge, inspired by the recent successes in rule-based reinforcement learning, we propose CLS-RL, which uses verifiable signals as reward to fine-tune MLLMs. We discovered that CLS-RL outperforms SFT in most datasets and has a much higher average accuracy on both base-to-new and few-shot learning setting. Moreover, we observed a free-lunch phenomenon for CLS-RL; when models are fine-tuned on a particular dataset, their performance on other distinct datasets may also improve over zero-shot models, even if those datasets differ in distribution and class names. This suggests that RL-based methods effectively teach models the fundamentals of classification. Lastly, inspired by recent works in inference time thinking, we re-examine the `thinking process' during fine-tuning, a critical aspect of RL-based methods, in the context of visual classification. We question whether such tasks require extensive thinking process during fine-tuning, proposing that this may actually detract from performance. Based on this premise, we introduce the No-Thinking-CLS-RL method, which minimizes thinking processes during training by setting an equality accuracy reward. Our findings indicate that, with much less fine-tuning time, No-Thinking-CLS-RL method achieves superior in-domain performance and generalization capabilities than CLS-RL.
Problem

Research questions and friction points this paper is trying to address.

Addresses overfitting in few-shot MLLM classification fine-tuning.
Proposes CLS-RL using rule-based reinforcement learning for better accuracy.
Introduces No-Thinking-CLS-RL to minimize thinking processes during training.
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLS-RL uses rule-based reinforcement learning for fine-tuning.
No-Thinking-CLS-RL minimizes thinking processes during training.
Verifiable signals as rewards enhance MLLM classification accuracy.
🔎 Similar Papers
No similar papers found.