🤖 AI Summary
In vision-language prompting, freezing the visual encoder often leads to feature misalignment and inter-class confusion. To address this, we propose a unified framework that balances task specialization with cross-domain generalization. Our method introduces two key innovations: (1) Confusion-Aware Loss (CoA-loss), which explicitly optimizes decision boundaries to mitigate class confusion; and (2) Confidence-Aware Mixture (CoA-weights), the first theoretically grounded mechanism for dynamically aligning prompt embeddings with frozen encoder outputs. Crucially, our approach requires no fine-tuning of the visual encoder, preserving its pre-trained semantics while enhancing adaptability. Extensive experiments demonstrate significant improvements in multi-task performance and cross-domain generalization, achieving state-of-the-art results on multiple benchmarks. The implementation is publicly available.
📝 Abstract
Prompt tuning, which adapts vision-language models by freezing model parameters and optimizing only the prompt, has proven effective for task-specific adaptations. The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization. To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes. Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization. This is achieved using confidence-aware weights (CoA-weights), which adjust the weights of each prediction in the mixture model based on its confidence within the class domains. Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization. Our code is publicly available at https://github.com/url-kaist/CoCoA-Mix.