🤖 AI Summary
Current AI models—particularly vision-language models (VLMs)—exhibit significant limitations in compositional generalization, and conventional tensor-based semantic modeling approaches yield suboptimal performance.
Method: This work introduces variational quantum circuits (VQCs) to image captioning for the first time, leveraging the expressive power of Hilbert space to enable compositional concept generalization. We propose a quantum image representation that integrates multi-hot encoding with CLIP-inspired angular and amplitude encodings, and exploit the efficient trainability of VQCs to optimize semantic composition learning.
Contribution/Results: Experiments demonstrate that our model substantially outperforms classical compositional baselines under noisy multi-hot encoding, exhibiting robust generalization; it also surpasses baseline methods when fed CLIP vector inputs. This study pioneers a quantum machine learning framework for semantic compositional generalization, offering a empirically validated, quantum-enhanced pathway to overcome fundamental cognitive limitations in AI.
📝 Abstract
Compositional generalization is a key facet of human cognition, but lacking in current AI tools such as vision-language models. Previous work examined whether a compositional tensor-based sentence semantics can overcome the challenge, but led to negative results. We conjecture that the increased training efficiency of quantum models will improve performance in these tasks. We interpret the representations of compositional tensor-based models in Hilbert spaces and train Variational Quantum Circuits to learn these representations on an image captioning task requiring compositional generalization. We used two image encoding techniques: a multi-hot encoding (MHE) on binary image vectors and an angle/amplitude encoding on image vectors taken from the vision-language model CLIP. We achieve good proof-of-concept results using noisy MHE encodings. Performance on CLIP image vectors was more mixed, but still outperformed classical compositional models.