🤖 AI Summary
This work addresses the challenge of knowledge fusion across heterogeneous visual models—including generative models, multimodal large language models, graphics engines, and physics simulators. We propose a training-free, inference-time expert ensemble framework. Our core innovation is the first training-free knowledge composition paradigm based on product distributions, enabling dynamic, cross-modal and cross-architectural expert collaboration. Coupled with annealed importance sampling (AIS), our method efficiently samples from the joint distribution of multiple experts while remaining compatible with arbitrary pre-trained visual and physical priors. The approach significantly improves generation controllability and target alignment, enabling fine-grained semantic control and interactive editing interfaces for image and video synthesis. Without any fine-tuning, it produces high-fidelity, editable outputs—overcoming fundamental flexibility and precision limitations inherent in monolithic models.
📝 Abstract
Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.