🤖 AI Summary
This work addresses visual question answering (VQA) over diagrammatic graph structures—such as metro network maps—where existing VQA methods struggle with graph-structured semantics and logical spatial reasoning. We propose an end-to-end interpretable neuro-symbolic approach that synergistically integrates Answer-Set Programming (ASP) with large language models (LLMs), decoupling multimodal perception (via optical graph recognition and pre-trained OCR, specifically PaddleOCR) from symbolic logical inference. Our contributions are threefold: (1) the first curated VQA dataset featuring metro-style graph-structured images; (2) a modular, fine-tuning-free neuro-symbolic architecture enabling seamless interoperability between pre-trained vision-language models and ASP solvers; and (3) state-of-the-art performance on our benchmark—73% average accuracy—demonstrating strong efficacy and intrinsic interpretability for complex spatial logical reasoning tasks.
📝 Abstract
Visual Question Answering (VQA) is a challenging problem that requires to process multimodal input. Answer-Set Programming (ASP) has shown great potential in this regard to add interpretability and explainability to modular VQA architectures. In this work, we address the problem of how to integrate ASP with modules for vision and natural language processing to solve a new and demanding VQA variant that is concerned with images of graphs (not graphs in symbolic form). Images containing graph-based structures are an ubiquitous and popular form of visualisation. Here, we deal with the particular problem of graphs inspired by transit networks, and we introduce a novel dataset that amends an existing one by adding images of graphs that resemble metro lines. Our modular neuro-symbolic approach combines optical graph recognition for graph parsing, a pretrained optical character recognition neural network for parsing labels, Large Language Models (LLMs) for language processing, and ASP for reasoning. This method serves as a first baseline and achieves an overall average accuracy of 73% on the dataset. Our evaluation provides further evidence of the potential of modular neuro-symbolic systems, in particular with pretrained models that do not involve any further training and logic programming for reasoning, to solve complex VQA tasks.