Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of scarce user-provided positive samples and low-quality retrieved negative samples in Vision-Language Model (VLM) personalization, this paper proposes Concept Tree (CaT), a controllable synthetic data framework. CaT models vision-language concepts hierarchically via a tree structure, enabling controlled generation of positive and negative samples with graded difficulty and high semantic diversity. Integrated with difficulty-aware sampling and semantic consistency filtering, it forms an end-to-end synthetic data pipeline that enhances generalization without requiring real negative samples. To our knowledge, this is the first controllable synthetic data method specifically designed for VLM personalization. Experiments on MyVLM, Yo’LLaVA, and MC-LLaVA demonstrate significant improvements in personalized performance and robustness against data sparsity and noise. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for fine-tuning. To reveal the relationship between sample and model performance, we systematically investigate the impact of positive and negative samples (easy and hard) and their diversity on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity for VLM personalization. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the personalization capabilities of VLMs across the MyVLM, Yo'LLaVA, and MC-LLaVA datasets. To our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code is released at href{https://github.com/zengkaiya/CaT}{https://github.com/zengkaiya/CaT}.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of user-provided positive samples for VLM personalization.
Improves quality of negative samples in VLM fine-tuning processes.
Introduces Concept-as-Tree for generating diverse and difficulty-varied samples.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept-as-Tree represents concepts as tree structures.
Generates diverse positive and negative synthetic samples.
Data filtering ensures high-quality VLM personalization.
🔎 Similar Papers
No similar papers found.