🤖 AI Summary
Modern vision-language models (VLMs) exhibit limited compositional reasoning capability, struggling to model complex interactions among multiple objects, attributes, and relations in images. To address this, we propose Neural-Symbolic Concept Trees (NSCT), a framework that leverages large language models (LLMs) to generate structured, hierarchical concept trees and employs beam search to produce interpretable semantic reasoning paths. NSCT semantically enhances and logically calibrates VLM outputs—without modifying the original VLM architecture—thereby improving compositional generalization. Evaluated on four standard compositional benchmarks, NSCT yields average accuracy gains of 5–10% across seven open-source VLMs, significantly boosting both reasoning accuracy and decision transparency. Our core contribution lies in the differentiable neural-symbolic integration of LLMs’ symbolic reasoning capacity with VLMs’ perceptual capabilities, enabling principled, interpretable, and architecture-agnostic compositional reasoning.
📝 Abstract
Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present 'COCO-Tree' - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM's linguistic reasoning. COCO-Tree's beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.