Compositional Zero-shot Learning via Progressive Language-based Observations

πŸ“… 2023-11-23
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 9
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses compositional zero-shot recognition (CZSR), the task of accurately recognizing images depicting unseen state-object compositions (e.g., β€œold cat”) not observed during training. To mitigate semantic ambiguity arising from conditional variations in objects and states, we propose the Progressive Language-guided Observation (PLO) framework. PLO introduces a learnable, dynamic observation ordering mechanism: first, a vision-language model (VLM) performs two-step decoupled observation of states and objects (PLO-VLM); second, a large language model (LLM) generates composition-specific, multi-step prompts to refine semantic parsing (PLO-LLM). By deeply integrating VLM and LLM capabilities, PLO explicitly models primitive interactions and enables cross-composition knowledge transfer. Extensive experiments demonstrate that PLO significantly outperforms state-of-the-art methods on three standard benchmarks, validating its effectiveness in fine-grained compositional generalization and cross-domain semantic alignment.
πŸ“ Abstract
Compositional zero-shot learning aims to recognize unseen state-object compositions by leveraging known primitives (state and object) during training. However, effectively modeling interactions between primitives and generalizing knowledge to novel compositions remains a perennial challenge. There are two key factors: object-conditioned and state-conditioned variance, i.e., the appearance of states (or objects) can vary significantly when combined with different objects (or states). For instance, the state"old"can signify a vintage design for a"car"or an advanced age for a"cat". In this paper, we argue that these variances can be mitigated by predicting composition categories based on pre-observed primitive. To this end, we propose Progressive Language-based Observations (PLO), which can dynamically determine a better observation order of primitives. These observations comprise a series of concepts or languages that allow the model to understand image content in a step-by-step manner. Specifically, PLO adopts pre-trained vision-language models (VLMs) to empower the model with observation capabilities. We further devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing classifier dynamically determines the observation order of two primitives. 2) PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to craft composition-specific prompts for step-by-step observing. Extensive ablations on three challenging datasets demonstrate the superiority of PLO compared with state-of-the-art methods, affirming its abilities in compositional recognition.
Problem

Research questions and friction points this paper is trying to address.

Recognizing unseen state-object compositions with known primitives
Modeling interactions between primitives for novel compositions
Mitigating appearance variance when states combine with different objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Language-based Observations for dynamic order
Leveraging pre-trained vision-language models for capabilities
Utilizing LLMs for composition-specific step-by-step prompts
L
Lin Li
Zhejiang University
Guikun Chen
Guikun Chen
Zhejiang University
Computer VisionArtificial IntelligenceAI4Science
J
Jun Xiao
Zhejiang University
L
Long Chen
The Hong Kong University of Science and Technology