🤖 AI Summary
This work addresses the challenges of local infeasibility and global inconsistency in compositional generative models under multimodal local distributions, which arise from mode averaging. To overcome this, the authors propose a novel paradigm that embeds guided search directly into the diffusion denoising process. By integrating population-based sampling, likelihood filtering, iterative resampling of overlapping segments, and a local-to-global message-passing mechanism, the method effectively avoids mode averaging and achieves coherent planning from local to global scales in long-horizon tasks. This approach represents the first deep integration of search mechanisms into the diffusion process, attaining oracle-level performance across seven robotic manipulation tasks—significantly outperforming non-compositional baselines—and successfully generalizing to panoramic image and long video generation, thereby demonstrating strong cross-domain coherence and feasibility.
📝 Abstract
Generative models have emerged as powerful tools for planning, with compositional approaches offering particular promise for modeling long-horizon task distributions by composing together local, modular generative models. This compositional paradigm spans diverse domains, from multi-step manipulation planning to panoramic image synthesis to long video generation. However, compositional generative models face a critical challenge: when local distributions are multimodal, existing composition methods average incompatible modes, producing plans that are neither locally feasible nor globally coherent. We propose Compositional Diffusion with Guided Search (CDGS), which addresses this mode averaging problem by embedding search directly within the diffusion denoising process. Our method explores diverse combinations of local modes through population-based sampling, prunes infeasible candidates using likelihood-based filtering, and enforces global consistency through iterative resampling between overlapping segments. CDGS matches oracle performance on seven robot manipulation tasks, outperforming baselines that lack compositionality or require long-horizon training data. The approach generalizes across domains, enabling coherent text-guided panoramic images and long videos through effective local-to-global message passing. More details: https://cdgsearch.github.io/