Visual Graph Question Answering with ASP and LLMs for Language Parsing

📅 2025-02-11
🏛️ Electronic Proceedings in Theoretical Computer Science
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses visual question answering (VQA) over diagrammatic graph structures—such as metro network maps—where existing VQA methods struggle with graph-structured semantics and logical spatial reasoning. We propose an end-to-end interpretable neuro-symbolic approach that synergistically integrates Answer-Set Programming (ASP) with large language models (LLMs), decoupling multimodal perception (via optical graph recognition and pre-trained OCR, specifically PaddleOCR) from symbolic logical inference. Our contributions are threefold: (1) the first curated VQA dataset featuring metro-style graph-structured images; (2) a modular, fine-tuning-free neuro-symbolic architecture enabling seamless interoperability between pre-trained vision-language models and ASP solvers; and (3) state-of-the-art performance on our benchmark—73% average accuracy—demonstrating strong efficacy and intrinsic interpretability for complex spatial logical reasoning tasks.

Technology Category

Application Category

📝 Abstract
Visual Question Answering (VQA) is a challenging problem that requires to process multimodal input. Answer-Set Programming (ASP) has shown great potential in this regard to add interpretability and explainability to modular VQA architectures. In this work, we address the problem of how to integrate ASP with modules for vision and natural language processing to solve a new and demanding VQA variant that is concerned with images of graphs (not graphs in symbolic form). Images containing graph-based structures are an ubiquitous and popular form of visualisation. Here, we deal with the particular problem of graphs inspired by transit networks, and we introduce a novel dataset that amends an existing one by adding images of graphs that resemble metro lines. Our modular neuro-symbolic approach combines optical graph recognition for graph parsing, a pretrained optical character recognition neural network for parsing labels, Large Language Models (LLMs) for language processing, and ASP for reasoning. This method serves as a first baseline and achieves an overall average accuracy of 73% on the dataset. Our evaluation provides further evidence of the potential of modular neuro-symbolic systems, in particular with pretrained models that do not involve any further training and logic programming for reasoning, to solve complex VQA tasks.
Problem

Research questions and friction points this paper is trying to address.

Integrate ASP with vision and NLP modules
Solve VQA for graph images in transit networks
Use neuro-symbolic approach for complex VQA tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

ASP for reasoning interpretability
LLMs for language processing
Optical graph recognition parsing
🔎 Similar Papers
No similar papers found.
J
Jakob Johannes Bauer
ETH Zürich, Rämistrasse 101, 8092 Zürich, Switzerland
Thomas Eiter
Thomas Eiter
Vienna University of Technology (TU Wien)
knowledge representation and reasoningdeclarative problem solvingartificial intelligencecomputational logic
N
N. Ruiz
Vienna University of Technology (TU Wien), Favoritenstrasse 9–11, Vienna, 1040, Austria
J
J. Oetsch
Jönköping University, Gjuterigatan 5, 55111 Jönköping, Sweden