๐ค AI Summary
Existing vision-language understanding methods rely on global similarity matching, which struggles to model fine-grained category distinctions and attribute diversity prevalent in e-commerce scenarios. To address this, we propose a detection-guided generative unified framework that, for the first time, integrates object detection, hierarchical classification, and attribute recognition into a single end-to-end sequence generation task. Specifically, region-of-interest (ROI) features from detected bounding boxes serve as input to a BART-based generator, which autoregressively produces a sequence comprising a coarse-to-fine category path followed by attribute-value pairsโenabling attribute-conditioned recognition and fine-grained semantic modeling. Extensive experiments on large-scale e-commerce and public benchmarks demonstrate that our approach significantly outperforms conventional multi-stage classification and similarity-matching methods, achieving state-of-the-art performance in both fine-grained recognition accuracy and inference consistency.
๐ Abstract
Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.