Product of Experts for Visual Generation

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of knowledge fusion across heterogeneous visual models—including generative models, multimodal large language models, graphics engines, and physics simulators. We propose a training-free, inference-time expert ensemble framework. Our core innovation is the first training-free knowledge composition paradigm based on product distributions, enabling dynamic, cross-modal and cross-architectural expert collaboration. Coupled with annealed importance sampling (AIS), our method efficiently samples from the joint distribution of multiple experts while remaining compatible with arbitrary pre-trained visual and physical priors. The approach significantly improves generation controllability and target alignment, enabling fine-grained semantic control and interactive editing interfaces for image and video synthesis. Without any fine-tuning, it produces high-fidelity, editable outputs—overcoming fundamental flexibility and precision limitations inherent in monolithic models.

Technology Category

Application Category

📝 Abstract

Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.

Problem

Research questions and friction points this paper is trying to address.

Integrating diverse knowledge from multiple visual models

Performing inference-time knowledge composition heterogeneously

Improving controllability in image and video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Product of Experts for knowledge composition

Annealed Importance Sampling for inference

Training-free heterogeneous model integration

🔎 Similar Papers

No similar papers found.