🤖 AI Summary
To address the dual challenges of scarce user-provided positive samples and low-quality retrieved negative samples in Vision-Language Model (VLM) personalization, this paper proposes Concept Tree (CaT), a controllable synthetic data framework. CaT models vision-language concepts hierarchically via a tree structure, enabling controlled generation of positive and negative samples with graded difficulty and high semantic diversity. Integrated with difficulty-aware sampling and semantic consistency filtering, it forms an end-to-end synthetic data pipeline that enhances generalization without requiring real negative samples. To our knowledge, this is the first controllable synthetic data method specifically designed for VLM personalization. Experiments on MyVLM, Yo’LLaVA, and MC-LLaVA demonstrate significant improvements in personalized performance and robustness against data sparsity and noise. The implementation is publicly available.
📝 Abstract
Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for fine-tuning. To reveal the relationship between sample and model performance, we systematically investigate the impact of positive and negative samples (easy and hard) and their diversity on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity for VLM personalization. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the personalization capabilities of VLMs across the MyVLM, Yo'LLaVA, and MC-LLaVA datasets. To our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code is released at href{https://github.com/zengkaiya/CaT}{https://github.com/zengkaiya/CaT}.