Unified Reinforcement and Imitation Learning for Vision-Language Models

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the deployment challenges of large-scale vision-language models (VLMs) in resource-constrained settings, this paper proposes a lightweight VLM efficient training framework. Methodologically, it unifies reinforcement learning and adversarial imitation learning by introducing a differentiable, large-language-model-based discriminator to replace handcrafted reward signals, and incorporates a multi-teacher collaborative distillation mechanism that fuses knowledge from multiple strong VLMs to enhance the student model’s cross-modal generation capability. Crucially, the approach improves text generation quality and generalization of small student models without increasing inference overhead. Experiments demonstrate that the proposed model achieves performance on par with—or even surpassing—that of comparable open-source and proprietary models across major vision-language benchmarks (NoCaps, FOIL, VizWiz), with average improvements of 5.2% on key metrics including BLEU-4 and CIDEr. This work establishes a novel paradigm for efficient VLM development.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.

Problem

Research questions and friction points this paper is trying to address.

Developing efficient training for lightweight vision-language models

Combining reinforcement and imitation learning for model enhancement

Reducing performance gap between small and large VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines reinforcement learning with imitation learning

Uses LLM-based discriminator to distinguish model outputs

Leverages multiple teacher models for diverse guidance

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models