Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the limited zero-shot classification performance of vision-language models (VLMs) on fine-grained and large-scale hierarchical label spaces. We propose a structured tree-based reasoning framework that decomposes classification into interpretable multi-level decision paths. Our method integrates large language model (LLM)-generated semantic category descriptions with image-contextual prompts to enhance VLMs’ modeling and alignment of hierarchical semantics. Experiments on GTSRB and CIFAR-10 show that the model achieves 98.2% accuracy in understanding tree-structured knowledge, validating the efficacy of structural priors. While pure tree-based reasoning slightly underperforms standard zero-shot baselines, incorporating image descriptions significantly boosts both approaches. To our knowledge, this is the first systematic investigation of decision-tree-guided structured reasoning for VLMs, revealing its promise—and limitations—in interpretability, semantic alignment, and hierarchical generalization.

Technology Category

Application Category

📝 Abstract

Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied. This paper investigates whether structured, tree-based reasoning can enhance VLM performance. We introduce a framework that decomposes classification into interpretable decisions using decision trees and evaluates it on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets. Although the model achieves 98.2% accuracy in understanding the tree knowledge, tree-based reasoning consistently underperforms standard zero-shot prompting. We also explore enhancing the tree prompts with LLM-generated classes and image descriptions to improve alignment. The added description enhances the performance of the tree-based and zero-shot methods. Our findings highlight limitations of structured reasoning in visual classification and offer insights for designing more interpretable VLM systems.

Problem

Research questions and friction points this paper is trying to address.

Assessing tree-based reasoning in vision language models

Evaluating structured reasoning on fine-grained visual classification

Investigating interpretable decision decomposition for VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-based reasoning framework for VLMs

LLM-generated classes and descriptions

Evaluated on fine and coarse-grained datasets

🔎 Similar Papers

Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification