TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

To address distribution shift in compositional zero-shot learning (CZSL) caused by unseen attribute–object combinations at test time, this paper proposes a test-time multimodal collaborative optimization framework. Methodologically, it introduces an unsupervised cross-modal knowledge accumulation mechanism that dynamically fuses textual semantics and visual features during inference, leveraging adaptive weight updating, dynamic priority queue management, and prototype-space alignment to enable continual evolution and joint optimization of multimodal prototypes. Crucially, the approach requires no additional annotations and effectively exploits historical visual knowledge to enhance generalization. Evaluated under both closed-world and open-world settings across four benchmark datasets, the method achieves state-of-the-art performance, notably improving recognition accuracy for unseen compositions.

Technology Category

Application Category

📝 Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT .

Problem

Research questions and friction points this paper is trying to address.

Addresses distribution shift in compositional zero-shot learning at test time

Updates multimodal prototypes using unsupervised textual and visual knowledge

Adapts flexibly to unseen attribute-object compositions during testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Accumulates comprehensive knowledge from unsupervised multimodal data

Adaptively updates multimodal prototypes with controlled adjustment weights

Aligns textual and visual prototypes through collaborative representation learning

🔎 Similar Papers

Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot Learning