Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of training vision-language models under low-resource conditions and inefficient multimodal information fusion, this paper proposes a lightweight decoder architecture. The core innovation is a token-level dynamic gating mechanism that automatically learns modality selection strategies without explicit supervision: content words prefer visual cues, while function words rely on linguistic cues—ensuring both interpretability and flexibility. Additionally, the method integrates feature modulation, channel-wise attention, and auxiliary contrastive learning to enhance representation quality. Under strict constraints of limited data and computational resources, the proposed approach achieves performance comparable to or surpassing state-of-the-art baselines across five mainstream multimodal benchmark tasks—including VQA, image captioning, and cross-modal retrieval—demonstrating its efficiency, generalizability, and practical applicability in resource-constrained settings.

Technology Category

Application Category

📝 Abstract
Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.
Problem

Research questions and friction points this paper is trying to address.

Adaptively fusing linguistic and visual cues with token-wise dynamic gating
Maximizing utility of limited visual information under resource constraints
Achieving efficient multimodal learning with interpretability and performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-wise dynamic gating for adaptive multimodal fusion
Feature modulation and channel attention for visual utility
Auxiliary contrastive objectives for visual grounding
🔎 Similar Papers
No similar papers found.
B
Bianca-Mihaela Ganescu
ALTA Institute, Department of Computer Science & Technology, University of Cambridge
Suchir Salhan
Suchir Salhan
University of Cambridge
Machine LearningLanguage ModelsNatural Language ProcessingLinguisticsCognitive Science
Andrew Caines
Andrew Caines
University of Cambridge
P
P. Buttery
ALTA Institute, Department of Computer Science & Technology, University of Cambridge