๐ค AI Summary
This work addresses the limitation of existing vision systems in modeling the structured hierarchical dependencies among scenes, objects, parts, and functions, which hinders interactive semantic understanding. It introduces, for the first time, a hierarchical scene parsing task that explicitly constructs a โscene โ object โ part โ functionโ hierarchy and proposes a unified generative framework based on vision-language models. Key contributions include the formal definition of this new task, the design of structure-completion pseudo-labels and a curriculum learning strategy, and the creation of SceneParser-Benchโa large-scale benchmark with tailored evaluation metrics. Experiments demonstrate that the proposed method significantly outperforms current multimodal large language models and perception-based composition approaches on SceneParser-Bench, while also exhibiting strong generalization and practical utility on downstream tasks such as COCO and AGD20K.
๐ Abstract
General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.