🤖 AI Summary
This work addresses the problem of poor text coherence in image-to-text generation, arising from insufficient alignment between visual and linguistic elements. To this end, we propose the Adaptive Clustering Fusion (ACF) model. Methodologically, ACF introduces a soft, learnable clustering window into the self-attention mechanism, enabling layer-wise hierarchical clustering of visual patches and text tokens—thereby implicitly establishing object-region–phrase alignment. Furthermore, it incorporates a data-driven dynamic clustering matrix that jointly models the visual encoder and language decoder, embedding cross-modal hierarchical knowledge into a parse-tree structure. Evaluated on image captioning and visual question answering, ACF substantially outperforms most state-of-the-art methods and achieves performance comparable to certain large-scale pretrained models. These results validate the effectiveness and generalizability of implicit hierarchical alignment for multimodal representation learning.
📝 Abstract
We propose a novel Transformer-based image-to-text generation model termed as extbf{ACF} that adaptively clusters vision patches into object regions and language words into phrases to implicitly learn object-phrase alignments for better visual-text coherence. To achieve this, we design a novel self-attention layer that applies self-attention over the elements in a local cluster window instead of the whole sequence. The window size is softly decided by a clustering matrix that is calculated by the current input data and thus this process is adaptive. By stacking these revised self-attention layers to construct ACF, the small clusters in the lower layers can be grouped into a bigger cluster, eg vision/language. ACF clusters small objects/phrases into bigger ones. In this gradual clustering process, a parsing tree is generated which embeds the hierarchical knowledge of the input sequence. As a result, by using ACF to build the vision encoder and language decoder, the hierarchical object-phrase alignments are embedded and then transferred from vision to language domains in two popular image-to-text tasks: Image captioning and Visual Question Answering. The experiment results demonstrate the effectiveness of ACF, which outperforms most SOTA captioning and VQA models and achieves comparable scores compared with some large-scale pre-trained models. Our code is available href{https://github.com/ZihuaEvan/ACFModel/}{[here]}.