VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale labels automatically generated by multimodal foundation models (e.g., CLIP, LLaVA) lack ground-truth annotations; existing evaluation methods rely on limited metrics or small-sample inspections, hindering detection of latent errors—especially in open-vocabulary image segmentation. Method: We propose the first human-in-the-loop visual analytics framework tailored for this task. It integrates multimodal output analysis, visual clustering-based diagnosis, interactive label correction, and an expert feedback loop to enable fine-grained quality assessment and iterative refinement. Contribution/Results: By introducing visual analytics into the quality assurance pipeline for auto-generated labels, our approach overcomes the limitations of purely quantitative or sampling-based validation. Evaluated on two benchmark datasets, it significantly improves downstream task performance while enabling efficient identification of systematic labeling errors and enhancing model generalization.

Technology Category

Application Category

📝 Abstract
The advances in multi-modal foundation models (FMs) (e.g., CLIP and LLaVA) have facilitated the auto-labeling of large-scale datasets, enhancing model performance in challenging downstream tasks such as open-vocabulary object detection and segmentation. However, the quality of FM-generated labels is less studied as existing approaches focus more on data quantity over quality. This is because validating large volumes of data without ground truth presents a considerable challenge in practice. Existing methods typically rely on limited metrics to identify problematic data, lacking a comprehensive perspective, or apply human validation to only a small data fraction, failing to address the full spectrum of potential issues. To overcome these challenges, we introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models. Targeting the complex and demanding domain of open-vocabulary image segmentation, VISTA integrates multi-phased data validation strategies with human expertise, enabling humans to identify, understand, and correct hidden issues within FM-generated labels. Through detailed use cases on two benchmark datasets and expert reviews, we demonstrate VISTA's effectiveness from both quantitative and qualitative perspectives.
Problem

Research questions and friction points this paper is trying to address.

Evaluating quality of FM-generated labels lacks comprehensive methods
Validating large datasets without ground truth is challenging
Existing approaches focus on quantity over label quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual analytics for FM label validation
Multi-phased data validation strategies
Human-AI collaboration for label correction
🔎 Similar Papers
No similar papers found.