Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the challenge of enabling vision-language models to compose multiple user-defined concepts for personalized recognition or description at test time, despite the absence of co-occurring training data. To this end, the authors propose the Gate-and-Merge framework, which learns lightweight LoRA adapters and dedicated concept tokens for each concept independently under a zero-shot setting. During inference, it merges LoRA updates in weight space and employs a gating mechanism to dynamically select relevant modules while suppressing interference. This approach achieves, for the first time, compositional personalization without requiring co-occurrence training, and incorporates a consistency-aware merging strategy to preserve concept disentanglement and enhance compositional stability. Experiments demonstrate significant improvements over baselines on both single-concept and compositional tasks, with quantitative and qualitative results validating its effectiveness.
📝 Abstract
This paper tackles compositional personalization of vision-language models (VLMs). In this problem, multiple user-defined concepts must be recognized or described jointly at test time. We introduce Gate-and-Merge, a zero-shot framework that enables compositional personalization without the need for co-occurrence training. During personalization, each concept is learned independently as a lightweight LoRA adapter, paired with a concept token. The base model remains unchanged and concepts are kept disentangled. At inference, we enable composition by merging concept-specific LoRA updates directly in weight space. To suppress irrelevant activations and prevent interference, a gating mechanism is employed to estimate textual and visual cues and select only the modules that contribute to the prediction. We further stabilize composition by combining only the most meaningful and mutually consistent updates, helping preserve each concept's identity. Our quantitative and qualitative analyses show consistent gains in performance across multiple personalization tasks in both single-concept and compositional settings.
Problem

Research questions and friction points this paper is trying to address.

compositional personalization
vision-language models
zero-shot
concept composition
personalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional personalization
zero-shot learning
LoRA merging
gating mechanism
vision-language models