🤖 AI Summary
Modeling structural information in multimodal (text + image) data remains challenging due to the inefficiency of classical high-order tensor computations and the lack of explicit structural reasoning capability in large language models. Method: This paper proposes the first structure-aware quantum multimodal processing framework. It introduces (1) a novel quantum circuit translation mechanism that integrates type theory with homomorphic mapping to enable verifiable encoding of syntactic/syntactic and visual hierarchical structures; and (2) a fully structured variational quantum architecture that overcomes classical tensor training bottlenecks. Results: Experiments on the SVO Probes image classification task achieve state-of-the-art classical performance, while enabling, for the first time, end-to-end, interpretable structured reasoning. This work establishes a new paradigm for joint quantum-structural modeling in multimodal AI.
📝 Abstract
While large language models (LLMs) have advanced the field of natural language processing (NLP), their"black box"nature obscures their decision-making processes. To address this, researchers developed structured approaches using higher order tensors. These are able to model linguistic relations, but stall when training on classical computers due to their excessive size. Tensors are natural inhabitants of quantum systems and training on quantum computers provides a solution by translating text to variational quantum circuits. In this paper, we develop MultiQ-NLP: a framework for structure-aware data processing with multimodal text+image data. Here,"structure"refers to syntactic and grammatical relationships in language, as well as the hierarchical organization of visual elements in images. We enrich the translation with new types and type homomorphisms and develop novel architectures to represent structure. When tested on a main stream image classification task (SVO Probes), our best model showed a par performance with the state of the art classical models; moreover the best model was fully structured.