Neuro-Symbolic Spatial Reasoning in Segmentation

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary semantic segmentation (OVSS) suffers from insufficient spatial relationship modeling for unseen categories, hindering vision-language models’ ability to accurately align local image patches with novel class concepts. To address this, we propose a neuro-symbolic spatial reasoning framework—the first to incorporate differentiable first-order logic into OVSS—explicitly encoding object-level spatial relations (e.g., “left-adjacent”, “contains”). Our method integrates a lightweight, end-to-end trainable spatial relation constraint module into vision-language models via pseudo-class generation and fuzzy logic relaxation. This enables joint optimization of pixel-wise semantic prediction and symbolic spatial reasoning. Crucially, our approach introduces only a single auxiliary loss term and adds no extra parameters. Evaluated on four standard benchmarks, it achieves state-of-the-art average mIoU, with particularly pronounced gains in multi-category, complex scenes.

Technology Category

Application Category

📝 Abstract
Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., <cat, to-right-of, person>, and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., "cat") and a spatial pseudo category (e.g., "right of person") simultaneously, enforcing relational constraints (e.g., a "cat" pixel must lie to the right of a "person"). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.
Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary segmentation struggles with unseen object spatial relations
Vision-language models lack spatial reasoning for object relationships
Neuro-symbolic approach enforces spatial constraints through logic formulas
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuro-symbolic spatial reasoning for open-vocabulary segmentation
First-order logic constraints encoded in neural architecture
Simultaneous prediction of semantic and spatial pseudo categories