DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing few-shot learning methods in achieving progressive and adaptive alignment between local visual attributes and global semantic concepts within vision–language fusion, which constrains semantic gains. To overcome this, the authors propose a dual-level vision–language alignment framework that leverages large language models to construct fine-grained attribute descriptions and holistic class-level representations. A reinforcement learning–driven gated attention mechanism, optimized via the REINFORCE algorithm, dynamically aligns cross-modal information by emphasizing local features in shallow network layers and prioritizing global semantics in deeper layers. The proposed method achieves state-of-the-art performance across nine benchmark datasets under three distinct few-shot evaluation settings.

Technology Category

Application Category

📝 Abstract
Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However, they overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains. To address these challenges, we propose Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL), which consists of Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). Specifically, DSC conditions LLMs on both class names and support samples to generate discriminative attributes, progressively selects the most relevant ones, and then synthesizes them into coherent class descriptions. This process provides complementary low-level attributes and high-level descriptions, enabling both fine-grained grounding and holistic class understanding. To dynamically integrate dual-level semantics along with the visual network layers, RLA formulates cross-modal fusion as a sequential decision process. A lightweight policy trained with episodic REINFORCE adaptively adjusts the contributions of self-attention and cross-attention to integrate textual and visual tokens. As a result, shallow layers refine local attributes and deep layers emphasize global semantics, enabling more precise cross-modal alignment. This achieves class-specific discrimination and generalized representations with merely a few support samples. DVLA-RL achieves new state-of-the-art performance across nine benchmarks in three diverse FSL scenarios.
Problem

Research questions and friction points this paper is trying to address.

Few-shot learning
Vision-language alignment
Semantic representation
Cross-modal fusion
Large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-shot learning
Vision-language alignment
Reinforcement learning gating
Dual-level semantics
Cross-modal fusion
🔎 Similar Papers
No similar papers found.
Wenhao Li
Wenhao Li
Shandong University
Wireless sensingmmWave radarSide-channel analysis
X
Xianjing Meng
School of Computing and Artificial Intelligence, Shandong University of Finance and Economics
Qiangchang Wang
Qiangchang Wang
Shandong University
Computer VisionDeep Learning
Zhongyi Han
Zhongyi Han
Professor, Shandong University
Machine LearningAgentic AIAI for Science
Z
Zhibin Wu
Software School, Shandong University
Y
Yilong Yin
Software School, Shandong University