🤖 AI Summary
To address the challenges of training vision-language models under low-resource conditions and inefficient multimodal information fusion, this paper proposes a lightweight decoder architecture. The core innovation is a token-level dynamic gating mechanism that automatically learns modality selection strategies without explicit supervision: content words prefer visual cues, while function words rely on linguistic cues—ensuring both interpretability and flexibility. Additionally, the method integrates feature modulation, channel-wise attention, and auxiliary contrastive learning to enhance representation quality. Under strict constraints of limited data and computational resources, the proposed approach achieves performance comparable to or surpassing state-of-the-art baselines across five mainstream multimodal benchmark tasks—including VQA, image captioning, and cross-modal retrieval—demonstrating its efficiency, generalizability, and practical applicability in resource-constrained settings.
📝 Abstract
Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.