Compositional Concept Generalization with Variational Quantum Circuits

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Current AI models—particularly vision-language models (VLMs)—exhibit significant limitations in compositional generalization, and conventional tensor-based semantic modeling approaches yield suboptimal performance. Method: This work introduces variational quantum circuits (VQCs) to image captioning for the first time, leveraging the expressive power of Hilbert space to enable compositional concept generalization. We propose a quantum image representation that integrates multi-hot encoding with CLIP-inspired angular and amplitude encodings, and exploit the efficient trainability of VQCs to optimize semantic composition learning. Contribution/Results: Experiments demonstrate that our model substantially outperforms classical compositional baselines under noisy multi-hot encoding, exhibiting robust generalization; it also surpasses baseline methods when fed CLIP vector inputs. This study pioneers a quantum machine learning framework for semantic compositional generalization, offering a empirically validated, quantum-enhanced pathway to overcome fundamental cognitive limitations in AI.

Technology Category

Application Category

📝 Abstract

Compositional generalization is a key facet of human cognition, but lacking in current AI tools such as vision-language models. Previous work examined whether a compositional tensor-based sentence semantics can overcome the challenge, but led to negative results. We conjecture that the increased training efficiency of quantum models will improve performance in these tasks. We interpret the representations of compositional tensor-based models in Hilbert spaces and train Variational Quantum Circuits to learn these representations on an image captioning task requiring compositional generalization. We used two image encoding techniques: a multi-hot encoding (MHE) on binary image vectors and an angle/amplitude encoding on image vectors taken from the vision-language model CLIP. We achieve good proof-of-concept results using noisy MHE encodings. Performance on CLIP image vectors was more mixed, but still outperformed classical compositional models.

Problem

Research questions and friction points this paper is trying to address.

Addressing compositional generalization in vision-language models

Training variational quantum circuits for image captioning tasks

Evaluating quantum models against classical compositional approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Quantum Circuits for learning

Hilbert space representations training

MHE and CLIP image encodings

🔎 Similar Papers

On the relation between trainability and dequantization of variational quantum learning models