Leveraging Vision-Language Models as Weak Annotators in Active Learning

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the high cost of manual annotation in fine-grained image recognition by proposing a novel active learning framework that, for the first time, integrates a vision-language model (VLM) as a weak annotator. The approach leverages the VLM to generate coarse-grained labels and combines them with a small set of human-provided fine-grained annotations. Through instance-level label assignment and explicit noise modeling, the method effectively corrects the VLM’s systematic biases across different semantic granularities. Experiments on the CUB-200 and FGVC-Aircraft benchmarks demonstrate that, under identical annotation budgets, the proposed strategy significantly outperforms existing active learning methods, achieving a superior trade-off between labeling cost and model performance.

📝 Abstract

Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further reduce the reliance on costly human annotation within the active learning paradigm. To this end, we find that the reliability of VLMs varies significantly with label granularity in fine-grained recognition tasks: they perform poorly on fine-grained labels but can provide accurate coarse-grained labels. Leveraging this property, we propose an active learning framework that combines fine-grained human annotations with coarse-grained VLM-generated weak labels through instance-wise label assignment. We further model the systematic noise in VLM-generated labels using a small set of trusted full labels. Experiments on CUB200 and FGVC-Aircraft show that the proposed framework consistently outperforms existing active learning methods under the same annotation budget.

Problem

Research questions and friction points this paper is trying to address.

active learning

vision-language models

weak annotation

fine-grained recognition

annotation cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

active learning

weak supervision