Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current vision-language models (VLMs) achieve high prediction accuracy but suffer from poor interpretability and are prone to factual hallucinations on out-of-distribution data. Existing neural-symbolic approaches extract symbols solely from task labels, lacking visual grounding and thus yielding semantically shallow symbolic representations. To address these limitations, we propose a multi-agent neural-symbolic system: a vision concept generator automatically discovers interpretable visual concepts directly from raw images; a large language model–based reasoning agent performs symbolic composition and generates first-order logic rules; and a vision verification agent enables end-to-end visual grounding of symbols, mitigating label bias. Evaluated on five benchmarks, our method achieves an average 5% performance gain, reduces hallucinated symbol incidence by up to 50%, and significantly enhances decision transparency and factual consistency.

Technology Category

Application Category

📝 Abstract

Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into 'why' a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.

Problem

Research questions and friction points this paper is trying to address.

Addresses VLMs' lack of explainability and hallucination issues in decisions

Enhances neurosymbolic reasoning by grounding symbols in visual data

Improves interpretability and reduces hallucinated symbols in rule-based systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal concept generator mines visual concepts from images

LLM reasoner composes symbols into first-order logic rules

Vision verifier quantifies symbol presence and triggers rule execution

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts