🤖 AI Summary
To address the critical issue of information loss or redundancy caused by fixed-frame sampling in video captioning, this paper proposes the first model-agnostic dynamic module selection framework. Our method jointly adapts the number of visual tokens and the scale of generation modules via adaptive visual token subset construction and a learnable attention masking mechanism. Key contributions include: (1) decoupling module selection from the backbone model to enable plug-and-play integration; (2) introducing a lightweight gating network for token importance estimation and subset selection; and (3) designing an adaptive attention mask to enhance modeling of salient spatiotemporal regions. Evaluated on MSVD, MSR-VTT, and VATEX, our framework consistently improves BLEU-4 and CIDEr scores across three representative video captioning architectures—achieving average gains of +2.1 and +3.7, respectively—while reducing computational overhead by 18%–25%.
📝 Abstract
Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.