Dual-Stream Collaborative Transformer for Image Captioning

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses the limitations of existing region-based image captioning methods, which often generate irrelevant content due to insufficient contextual information and excessive reliance on previously generated text. To mitigate semantic inconsistency and spatial misalignment, the authors propose a dual-stream collaborative Transformer architecture that dynamically integrates region and segmentation features for the first time. The approach introduces a Pattern-Specific Mutual Attention Encoder (PSMAE) to enable cross-modal feature co-modeling and designs a Dynamic Nomination Decoder (DND) for adaptive feature selection. Evaluated on standard image captioning benchmarks, the model significantly outperforms current state-of-the-art methods, producing more accurate and descriptive captions.

Technology Category

Application Category

📝 Abstract

Current region feature-based image captioning methods have progressed rapidly and achieved remarkable performance. However, they are still prone to generating irrelevant descriptions due to the lack of contextual information and the over-reliance on generated partial descriptions for predicting the remaining words. In this paper, we propose a Dual-Stream Collaborative Transformer (DSCT) to address this issue by introducing the segmentation feature. The proposed DSCT consolidates and then fuses the region and segmentation features to guide the generation of caption sentences. It contains multiple Pattern-Specific Mutual Attention Encoders (PSMAEs) and Dynamic Nomination Decoders (DNDs). The PSMAE effectively highlights and consolidates the private information of two representations by querying each other. The DND dynamically searches for the most relevant learning blocks to the input textual representations and exploits the homogeneous features between the consolidated region and segmentation features to generate more accurate and descriptive caption sentences. To the best of our knowledge, this is the first study to explore how to fuse different pattern-specific features in a dynamic way to bypass their semantic inconsistencies and spatial misalignment issues for image captioning. The experimental results from popular benchmark datasets demonstrate that our DSCT outperforms the state-of-the-art image captioning models in the literature.

Problem

Research questions and friction points this paper is trying to address.

image captioning

region features

contextual information

semantic inconsistency

spatial misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Stream Collaborative Transformer

Segmentation Feature

Pattern-Specific Mutual Attention