🤖 AI Summary
This work investigates whether vision models can generalize attribute knowledge (e.g., “has four legs”) across semantically and perceptually distant categories (e.g., “dog” vs. “chair”). To address the limitation of existing attribute prediction benchmarks—whose implicit category correlations overestimate cross-category generalization—we propose a novel train-test split strategy based on LLM-driven semantic grouping, embedding similarity thresholding, and hierarchical clustering. This enables the first systematic evaluation of attribute robustness across unrelated superclasses. Experiments show that model performance degrades significantly as semantic decoupling between training and test sets increases; our clustering-based split achieves the optimal trade-off between eliminating spurious correlations and preserving learnability. The study exposes critical fragility in current attribute prediction methods and establishes a more rigorous, generalizable benchmark for evaluating compositional and zero-shot attribute reasoning.
📝 Abstract
Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute "has four legs" is common to both "dogs" and "chairs". To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.