MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot learning (ZSL) methods leveraging encyclopedia documents as auxiliary knowledge struggle to suppress non-visual noise and often misalign non-visual words with image regions, degrading semantic transfer. To address this, we propose a multi-attribute supervised framework for document-based ZSL. Our approach introduces, for the first time, an LLM-driven document denoising and multi-perspective enhancement prompting algorithm to extract vision-relevant semantics; designs a model-agnostic focusing loss to explicitly enforce discriminative region–word alignment; and integrates local–global semantic interaction modeling with knowledge disentanglement. Evaluated on three document-based ZSL and generalized ZSL benchmarks, our method achieves average improvements of 7.2% and 8.2% over state-of-the-art methods, respectively, while enabling multi-attribute interpretable predictions at comparable computational cost.

Technology Category

Application Category

📝 Abstract
Zero-shot learning (ZSL) aims to train a model on seen classes and recognize unseen classes by knowledge transfer through shared auxiliary information. Recent studies reveal that documents from encyclopedias provide helpful auxiliary information. However, existing methods align noisy documents, entangled in visual and non-visual descriptions, with image regions, yet solely depend on implicit learning. These models fail to filter non-visual noise reliably and incorrectly align non-visual words to image regions, which is harmful to knowledge transfer. In this work, we propose a novel multi-attribute document supervision framework to remove noises at both document collection and model learning stages. With the help of large language models, we introduce a novel prompt algorithm that automatically removes non-visual descriptions and enriches less-described documents in multiple attribute views. Our proposed model, MADS, extracts multi-view transferable knowledge with information decoupling and semantic interactions for semantic alignment at local and global levels. Besides, we introduce a model-agnostic focus loss to explicitly enhance attention to visually discriminative information during training, also improving existing methods without additional parameters. With comparable computation costs, MADS consistently outperforms the SOTA by 7.2% and 8.2% on average in three benchmarks for document-based ZSL and GZSL settings, respectively. Moreover, we qualitatively offer interpretable predictions from multiple attribute views.
Problem

Research questions and friction points this paper is trying to address.

Improves zero-shot learning by filtering non-visual noise in documents.
Enhances semantic alignment through multi-attribute document supervision.
Introduces a focus loss to improve attention on visual information.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-attribute document supervision framework
Prompt algorithm for noise removal
Model-agnostic focus loss enhancement
🔎 Similar Papers
No similar papers found.
Xiangyan Qu
Xiangyan Qu
IIE
Jing Yu
Jing Yu
Northwestern University
SustainabilityLife Cycle AnalysisTransportation ManagementOperations Research
J
Jiamin Zhuang
Institute of Information Engineering, Chinese Academy of Sciences, China, and the School of Cyber Security, University of Chinese Academy of Sciences, China
G
Gaopeng Gou
Institute of Information Engineering, Chinese Academy of Sciences, China, and the School of Cyber Security, University of Chinese Academy of Sciences, China
G
Gang Xiong
Institute of Information Engineering, Chinese Academy of Sciences, China, and the School of Cyber Security, University of Chinese Academy of Sciences, China
Q
Qi Wu
Australia Centre for Robotic Vision (ACRV), the University of Adelaide