UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

๐Ÿ“… 2025-11-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing vision-language understanding methods rely on global similarity matching, which struggles to model fine-grained category distinctions and attribute diversity prevalent in e-commerce scenarios. To address this, we propose a detection-guided generative unified framework that, for the first time, integrates object detection, hierarchical classification, and attribute recognition into a single end-to-end sequence generation task. Specifically, region-of-interest (ROI) features from detected bounding boxes serve as input to a BART-based generator, which autoregressively produces a sequence comprising a coarse-to-fine category path followed by attribute-value pairsโ€”enabling attribute-conditioned recognition and fine-grained semantic modeling. Extensive experiments on large-scale e-commerce and public benchmarks demonstrate that our approach significantly outperforms conventional multi-stage classification and similarity-matching methods, achieving state-of-the-art performance in both fine-grained recognition accuracy and inference consistency.

Technology Category

Application Category

๐Ÿ“ Abstract
Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.
Problem

Research questions and friction points this paper is trying to address.

Unified framework handles object detection, category prediction, and attribute recognition simultaneously
Overcomes limitations in capturing fine-grained category distinctions and attribute diversity
Addresses visual semantic understanding challenges in large-scale e-commerce scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detection-guided generative framework predicts hierarchical tokens
BART-based generator produces coarse-to-fine semantic sequences
ROI-level features enable property-conditioned attribute recognition
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xinyu Nan
Kuaishou Technology, Beijing, China
L
Lingtao Mao
Kuaishou Technology, Beijing, China
H
Huangyu Dai
Kuaishou Technology, Beijing, China
Z
Zexin Zheng
Kuaishou Technology, Beijing, China
X
Xinyu Sun
Kuaishou Technology, Beijing, China
Z
Zihan Liang
Kuaishou Technology, Beijing, China
Ben Chen
Ben Chen
KuaiShou, Alibaba, HUST, WHU
MultimodalLLMGenerative RecommendationSemantic Matching
Y
Yuqing Ding
Kuaishou Technology, Beijing, China
Chenyi Lei
Chenyi Lei
Kuaishou Technology
Recommender SystemInformation RetrievalGenerative RecommendationMultimodal
W
Wenwu Ou
Kuaishou Technology, Beijing, China
H
Han Li
Kuaishou Technology, Beijing, China