🤖 AI Summary
This work addresses the challenge of effectively decoupling artistic style from semantic content, which are typically highly entangled. To this end, we propose StyleDecoupler, a training-free, general-purpose framework that leverages information theory to disentangle pure style representations. Specifically, by minimizing mutual information, our method aligns features between a frozen multimodal vision-language model—encoding both style and content—and a unimodal model that emphasizes content-invariant characteristics. This plug-and-play approach requires no fine-tuning and introduces WeART, a large-scale benchmark for artistic style analysis. Evaluated on both WeART and WikiArt, StyleDecoupler achieves state-of-the-art performance in style retrieval, style relationship mapping, and evaluation of generative models.
📝 Abstract
Representing artistic style is challenging due to its deep entanglement with semantic content. We propose StyleDecoupler, an information-theoretic framework that leverages a key insight: multi-modal vision models encode both style and content, while uni-modal models suppress style to focus on content-invariant features. By using uni-modal representations as content-only references, we isolate pure style features from multi-modal embeddings through mutual information minimization. StyleDecoupler operates as a plug-and-play module on frozen Vision-Language Models without fine-tuning. We also introduce WeART, a large-scale benchmark of 280K artworks across 152 styles and 1,556 artists. Experiments show state-of-the-art performance on style retrieval across WeART and WikiART, while enabling applications like style relationship mapping and generative model evaluation. We release our method and dataset at this url.