UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language understanding methods rely on global similarity matching, which struggles to model fine-grained category distinctions and attribute diversity prevalent in e-commerce scenarios. To address this, we propose a detection-guided generative unified framework that, for the first time, integrates object detection, hierarchical classification, and attribute recognition into a single end-to-end sequence generation task. Specifically, region-of-interest (ROI) features from detected bounding boxes serve as input to a BART-based generator, which autoregressively produces a sequence comprising a coarse-to-fine category path followed by attribute-value pairs—enabling attribute-conditioned recognition and fine-grained semantic modeling. Extensive experiments on large-scale e-commerce and public benchmarks demonstrate that our approach significantly outperforms conventional multi-stage classification and similarity-matching methods, achieving state-of-the-art performance in both fine-grained recognition accuracy and inference consistency.

Technology Category

Application Category

📝 Abstract

Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.

Problem

Research questions and friction points this paper is trying to address.

Unified framework handles object detection, category prediction, and attribute recognition simultaneously

Overcomes limitations in capturing fine-grained category distinctions and attribute diversity

Addresses visual semantic understanding challenges in large-scale e-commerce scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detection-guided generative framework predicts hierarchical tokens

BART-based generator produces coarse-to-fine semantic sequences

ROI-level features enable property-conditioned attribute recognition

🔎 Similar Papers

No similar papers found.

Authors to Follow