Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the challenges of training vision-language models under low-resource conditions and inefficient multimodal information fusion, this paper proposes a lightweight decoder architecture. The core innovation is a token-level dynamic gating mechanism that automatically learns modality selection strategies without explicit supervision: content words prefer visual cues, while function words rely on linguistic cues—ensuring both interpretability and flexibility. Additionally, the method integrates feature modulation, channel-wise attention, and auxiliary contrastive learning to enhance representation quality. Under strict constraints of limited data and computational resources, the proposed approach achieves performance comparable to or surpassing state-of-the-art baselines across five mainstream multimodal benchmark tasks—including VQA, image captioning, and cross-modal retrieval—demonstrating its efficiency, generalizability, and practical applicability in resource-constrained settings.

Technology Category

Application Category

📝 Abstract

Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.

Problem

Research questions and friction points this paper is trying to address.

Adaptively fusing linguistic and visual cues with token-wise dynamic gating

Maximizing utility of limited visual information under resource constraints

Achieving efficient multimodal learning with interpretability and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-wise dynamic gating for adaptive multimodal fusion

Feature modulation and channel attention for visual utility

Auxiliary contrastive objectives for visual grounding

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models