🤖 AI Summary
This work addresses the problem of decomposing semantic representations in visual scenes, aiming to recover constituent semantic factors from entangled composite representations. To this end, it introduces diffusion models to this task for the first time and proposes a coupled inference framework that constrains the composition of estimated factors—via a reconstruction-driven guidance term—to approximate the original bound vector. A novel iterative sampling strategy is further designed to enhance decomposition performance. Theoretical analysis reveals that attention-based resonator networks constitute a special case of the proposed framework. Extensive experiments on multiple synthetic semantic decomposition tasks demonstrate that the method significantly outperforms conventional resonator networks, exhibiting superior effectiveness and generalization capability.
📝 Abstract
Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.