Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited zero-shot classification performance of vision-language models (VLMs) on fine-grained and large-scale hierarchical label spaces. We propose a structured tree-based reasoning framework that decomposes classification into interpretable multi-level decision paths. Our method integrates large language model (LLM)-generated semantic category descriptions with image-contextual prompts to enhance VLMs’ modeling and alignment of hierarchical semantics. Experiments on GTSRB and CIFAR-10 show that the model achieves 98.2% accuracy in understanding tree-structured knowledge, validating the efficacy of structural priors. While pure tree-based reasoning slightly underperforms standard zero-shot baselines, incorporating image descriptions significantly boosts both approaches. To our knowledge, this is the first systematic investigation of decision-tree-guided structured reasoning for VLMs, revealing its promise—and limitations—in interpretability, semantic alignment, and hierarchical generalization.

Technology Category

Application Category

📝 Abstract
Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied. This paper investigates whether structured, tree-based reasoning can enhance VLM performance. We introduce a framework that decomposes classification into interpretable decisions using decision trees and evaluates it on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets. Although the model achieves 98.2% accuracy in understanding the tree knowledge, tree-based reasoning consistently underperforms standard zero-shot prompting. We also explore enhancing the tree prompts with LLM-generated classes and image descriptions to improve alignment. The added description enhances the performance of the tree-based and zero-shot methods. Our findings highlight limitations of structured reasoning in visual classification and offer insights for designing more interpretable VLM systems.
Problem

Research questions and friction points this paper is trying to address.

Assessing tree-based reasoning in vision language models
Evaluating structured reasoning on fine-grained visual classification
Investigating interpretable decision decomposition for VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-based reasoning framework for VLMs
LLM-generated classes and descriptions
Evaluated on fine and coarse-grained datasets
🔎 Similar Papers
No similar papers found.
S
Sary Elmansoury
Department of Data Science and its Applications, German Research Centre for Artificial Intelligence (DFKI)
I
Islam Mesabah
Department of Data Science and its Applications, German Research Centre for Artificial Intelligence (DFKI)
G
Gerrit Großmann
Department of Data Science and its Applications, German Research Centre for Artificial Intelligence (DFKI)
Peter Neigel
Peter Neigel
German Research Center for Artificial Intelligence
R
Raj Bhalwankar
Department of Data Science and its Applications, German Research Centre for Artificial Intelligence (DFKI)
Daniel Kondermann
Daniel Kondermann
Quality Match GmbH
Dataset QualityPerformance AnalysisComputer VisionOptical FlowStereo
Sebastian J. Vollmer
Sebastian J. Vollmer
University of Kaiserslautern/DFKI